diff --git a/CHANGELOG.md b/CHANGELOG.md
index a418edc011..c98cdb515f 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,5 +1,25 @@
 # NVIDIA CUTLASS Changelog
 
+## [3.6.0](https://github.com/NVIDIA/cutlass/releases/tag/v3.6.0) (2024-10-03)
+
+- [Hopper structured sparse GEMM](./examples/62_hopper_sparse_gemm/62_hopper_sparse_gemm.cu).
+  + [FP16](./test/unit/gemm/device/sm90_sparse_gemm_f16_f16_f32_tensor_op_f32.cu)
+  + [FP8](./test/unit/gemm/device/sm90_sparse_gemm_f8_f8_f32_tensor_op_f32.cu)
+  + [INT8](./test/unit/gemm/device/sm90_sparse_gemm_s8_s8_s32_tensor_op_s32.cu)
+  + [TF32](./test/unit/gemm/device/sm90_sparse_gemm_tf32_tf32_f32_tensor_op_f32.cu)
+- A refactor to the CUTLASS 3.x convolution `kernel::ConvUniversal` [API](./include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp) to bring it in line with `gemm::GemmUniversal`. Now the 3.x convolution API is no longer considered as a beta API.
+- [An improved mixed input GEMM](./examples/55_hopper_mixed_dtype_gemm/README.md) and a [lookup table implementation](./examples/55_hopper_mixed_dtype_gemm/55_hopper_int4_fp8_gemm.cu) for `INT4`x`FP8` scale-only mode.
+- [EVT nodes for Top-K selection and softmax](./include/cutlass/epilogue/fusion/sm90_visitor_topk_softmax.hpp) and [GEMM example using those](./examples/61_hopper_gemm_with_topk_and_softmax/61_hopper_gemm_with_topk_and_softmax.cu).
+- [Programmatic Dependent Launch](./include/cutlass/arch/grid_dependency_control.h) (PDL) that leverages a new Hopper feature to speedup two back-to-back kernels, and its corresponding [documentations](./media/docs/dependent_kernel_launch.md).
+- [A new debugging tool, synclog](./include/cutlass/arch/synclog.hpp), for dumping out all synchronization events from within a kernel to a file. Please see [synclog documentation](./media/docs/utilities.md#debugging-asynchronous-kernels-with-cutlasss-built-in-synclog-tool) for details.
+- A new TMA-enabled [epilogue](./include/cutlass/epilogue/collective/sm90_epilogue_array_tma_warpspecialized.hpp) for grouped GEMM that brings significant performance improvement, as well as its EVT support.
+- A SIMT-enabled pointer-array [epilogue](./include/cutlass/epilogue/collective/sm70_epilogue_vectorized_array.hpp).
+- A new [Ping-Pong kernel schedule for Grouped GEMM](./include/cutlass/gemm/kernel/sm90_gemm_array_tma_warpspecialized_pingpong.hpp) and some other optimizations.
+- [A new instantiation strategy for CUTLASS profiler kernels](./python/cutlass_library/sm90_shapes.py) along with [improved documentation for instantiation level in CUTLASS profiler](./media/docs/profiler.md#instantiating-more-kernels-with-hopper).
+- A new hardware support for comparisons and computations of [`cutlass::bfloat16_t`](./include/cutlass/bfloat16.h)
+- Fixed use of isnan on Windows for [`half_t`](./test/unit/core/functional.cu).
+  Various improvements and fixed from the community and CUTLASS team. Thanks to everyone who submitted PRs!
+
 ## [3.5.1](https://github.com/NVIDIA/cutlass/releases/tag/v3.5.1) (2024-07-25)
 
 - [Minimal SM90 WGMMA + TMA GEMM example in 100 lines of code](./examples/cute/tutorial/wgmma_sm90.cu)
@@ -7,10 +27,15 @@
 - Exposure of raster order and tile swizzle extent in [CUTLASS library profiler](./media/docs/profiler.md#GEMM), and
 [example 48](./examples/48_hopper_warp_specialized_gemm/48_hopper_warp_specialized_gemm.cu).
 - [TMA store based and EVT supported epilogues](./include/cutlass/epilogue/collective/sm90_epilogue_array_tma_warpspecialized.hpp) for [Hopper pointer array batched kernels](./test/unit/gemm/device/sm90_gemm_f16_f16_f16_tensor_op_f32_ptr_array.cu).
-- A new [`GemmSparseUniversal` API for CUTLASS 2.x Ampere kernels](./include/cutlass/gemm/device/gemm_sparse_universal.h) leveraging 2:4 structured sparsity and [support for LLM friendly tile sizes](./test/unit/gemm/device/gemm_f16n_f16t_f32t_tensor_op_f32_sparse_sm80.cu).
+- A new [`GemmSparseUniversal` API for CUTLASS 2.x Ampere kernels](./include/cutlass/gemm/device/gemm_sparse_universal.h) to enable serial and parallel split-k for sparse tensor cores and new tiny tile sizes to better support LLM inferrence:
+  + [FP16 TN](./test/unit/gemm/device/gemm_f16t_f16n_f32t_tensor_op_f32_sparse_sm80.cu#L269-L393) and [NT](./test/unit/gemm/device/gemm_f16n_f16t_f32t_tensor_op_f32_sparse_sm80.cu#L269-L411).
+  + [int8 TN](./test/unit/gemm/device/gemm_s8t_s8n_s32t_tensor_op_s32_sparse_sm80.cu#L264-L452).
+  + [int4 TN](./test/unit/gemm/device/gemm_s4t_s4n_s32t_tensor_op_s32_sparse_sm80.cu#L264-L452).
+  + [FP32 TN](./test/unit/gemm/device/gemm_f32t_f32n_f32t_tensor_op_f32_sparse_sm80.cu#L427-L642) and [NT](./test/unit/gemm/device/gemm_f32n_f32t_f32t_tensor_op_f32_sparse_sm80.cu#L427-L456).
 - [CUDA host adapter](./include/cutlass/cuda_host_adapter.hpp) extensions to support TMA descriptor construction driver APIs.
 - Inclusion of more [Hopper fprop, dgrad, and wgrad convolution kernels in CUTLASS library and profiler](./python/cutlass_library/generator.py).
 - Support for residual add (beta != 0) in convolution kernels.
+- A new convolution [epilogue](./examples/16_ampere_tensorop_conv2dfprop/ampere_tensorop_conv2dfprop.cu#L269) for CUTLASS 2.x to support non-packed NHWC output.
 - A refactor of [include files throughout CUTLASS core directories](./include/cutlass/gemm/collective/collective_mma_decl.hpp) to reduce circular dependencies and [tests to guard against them](./test/self_contained_includes/CMakeLists.txt).
 - [A guide for setting up VSCode to work well with CUTLASS](./media/docs/ide_setup.md) and [expanded code style guide](./media/docs/programming_guidelines.md).
 - Better support for MSVC as a host compiler.
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 28e2c2b4c0..9187927b13 100755
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -177,7 +177,6 @@ set(CUTLASS_ENABLE_BENCHMARKS ON CACHE BOOL "Enable CUTLASS Benchmarks")
 set(CUTLASS_ENABLE_TESTS ${CUTLASS_ENABLE_TESTS_INIT} CACHE BOOL "Enable CUTLASS Tests")
 set(CUTLASS_ENABLE_GTEST_UNIT_TESTS ${CUTLASS_ENABLE_TESTS} CACHE BOOL "Enable CUTLASS GTest-based Unit Tests")
 set(CUTLASS_USE_SYSTEM_GOOGLETEST OFF CACHE BOOL "Use system/external installation of GTest")
-
 set(CUTLASS_USE_PACKED_TUPLE ON CACHE BOOL "If ON, make cute::tuple be new standard-layout tuple type; if OFF, use the original cute::tuple implementation that is _not_ standard-layout.")
 if (CUTLASS_USE_PACKED_TUPLE)
   list(APPEND CUTLASS_CUDA_NVCC_FLAGS -DCUTE_USE_PACKED_TUPLE=1)
@@ -315,6 +314,7 @@ set(CUTLASS_LIBRARY_OPERATIONS "all" CACHE STRING "Comma-delimited list of opera
 set(CUTLASS_LIBRARY_KERNELS ${CUTLASS_LIBRARY_KERNELS_INIT} CACHE STRING "Comma-delimited list of kernel name filters. If unspecified, only the largest tile size is enabled. If the string 'all' is specified, all kernels are enabled.")
 set(CUTLASS_LIBRARY_IGNORE_KERNELS "" CACHE STRING "Comma-delimited list of kernels to exclude from build. This option ONLY takes effect if CUTLASS_LIBRARY_KERNELS is set.")
 set(CUTLASS_LIBRARY_EXCLUDE_KERNELS "" CACHE STRING "Comma-delimited list of kernels to exclude from build. This option always takes effect, whether or not CUTLASS_LIBRARY_KERNELS is set. It also can exclude kernels from the filter file (see KERNEL_FILTER_FILE).")
+set(CUTLASS_LIBRARY_INSTANTIATION_LEVEL "" CACHE STRING "Instantiation level for SM90 kernels. Set to `max` and make sure CUTLASS_LIBRARY_KERNELS is non-empty to stamp all possible kernel configurations.")
 
 ################################################################################
 
@@ -362,6 +362,8 @@ if(CUTLASS_ENABLE_SM90_EXTENDED_MMA_SHAPES)
   list(APPEND CUTLASS_CUDA_NVCC_FLAGS -DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
 endif()
 
+set(CUTLASS_SKIP_REDUCTION_INIT OFF CACHE BOOL "Disable init reduction workspace")
+
 #
 # NOTE: running with asan and CUDA requires the following environment variable:
 #
@@ -389,6 +391,10 @@ if(CUTLASS_NVCC_EMBED_PTX)
   list(APPEND CUTLASS_CUDA_CLANG_FLAGS --cuda-include-ptx=all)
 endif()
 
+if (CUTLASS_SKIP_REDUCTION_INIT)
+  list(APPEND CUTLASS_CUDA_NVCC_FLAGS -DCUTLASS_SKIP_REDUCTION_INIT=1)
+endif()
+
 if (CUTLASS_ENABLE_TENSOR_CORE_MMA)
   list(APPEND CUTLASS_CUDA_FLAGS -DCUTLASS_ENABLE_TENSOR_CORE_MMA=1)
 endif()
@@ -398,6 +404,18 @@ if (CUTLASS_PROFILER_DISABLE_REFERENCE)
   list(APPEND CUTLASS_CUDA_NVCC_FLAGS -DCUTLASS_PROFILER_DISABLE_REFERENCE=1)
 endif()
 
+if (CUTLASS_ENABLE_GDC_FOR_SM90)
+  message(STATUS "Grid Dependency Control (GDC) is enabled for SM90 kernels (required for programmatic dependent launches).")
+  list(APPEND CUTLASS_CUDA_NVCC_FLAGS -DCUTLASS_ENABLE_GDC_FOR_SM90=1)
+endif()
+
+set(CUTLASS_ENABLE_SYNCLOG OFF CACHE BOOL "Enable synchronization event logging for race condition debugging. WARNING: This redefines __syncthreads() and __syncwarp() in all downstream code!")
+
+if (CUTLASS_ENABLE_SYNCLOG)
+  set(CMAKE_CUDA_SEPARABLE_COMPILATION ON)
+  string(APPEND CMAKE_CXX_FLAGS " -DCUTLASS_ENABLE_SYNCLOG=1")
+  string(APPEND CMAKE_CUDA_FLAGS " -DCUTLASS_ENABLE_SYNCLOG=1")
+endif()
 
 
 
@@ -926,12 +944,27 @@ function(cutlass_add_executable_tests NAME TARGET)
 
   set(TEST_GROUP_NAME ${NAME})
 
+  # To run the tests from an install package with tests enabled, we need to generate test files
+  # that don't rely on the current directory structure in build.
+
+  set(TEST_NAME c${NAME})
+  set(TEST_GEN_DIR ${CMAKE_CURRENT_BINARY_DIR}/ctest/${TEST_NAME})
+  file(MAKE_DIRECTORY ${TEST_GEN_DIR})
+
+  set(TEST_EXE_PATH $<TARGET_FILE:${TARGET}>)
+  set(TEST_USE_EXTENDED_FORMAT ON)
+  configure_file("${CUTLASS_CTEST_TEMPLATE_FILE}" "${TEST_GEN_DIR}/CTestTestfile.${TEST_NAME}.cmake" @ONLY)
+
+  set(TEST_EXE_PATH $<TARGET_FILE_NAME:${TARGET}>)
+  set(TEST_USE_EXTENDED_FORMAT OFF) # ctest does not support extended add_test format.
+  configure_file("${CUTLASS_CTEST_TEMPLATE_FILE}" "${TEST_GEN_DIR}/CTestTestfile.${TEST_NAME}.install.cmake.in" @ONLY)
+
   foreach(CMD_OPTIONS_VAR IN LISTS __TEST_COMMAND_OPTIONS)
 
     if (CMD_COUNT GREATER 1)
-      string(TOLOWER "${NAME}_${CMD_OPTIONS_VAR}" TEST_NAME)
+      string(TOLOWER "${NAME}_${CMD_OPTIONS_VAR}" TESTCASE_NAME)
     else()
-      string(TOLOWER "${NAME}" TEST_NAME)
+      string(TOLOWER "${NAME}" TESTCASE_NAME)
     endif()
 
     # The following rigmarole is needed to deal with spaces and possible quotes in
@@ -945,7 +978,7 @@ function(cutlass_add_executable_tests NAME TARGET)
     separate_arguments(TEST_COMMAND_OPTIONS)
 
     add_custom_target(
-      ${TEST_NAME}
+      ${TESTCASE_NAME}
       COMMAND
       ${CUTLASS_TEST_EXECUTION_ENVIRONMENT} $<TARGET_FILE:${TARGET}> ${TEST_COMMAND_OPTIONS}
       DEPENDS
@@ -953,34 +986,20 @@ function(cutlass_add_executable_tests NAME TARGET)
       )
 
     if (CMD_COUNT GREATER 1)
-      add_dependencies(${NAME} ${TEST_NAME})
+      add_dependencies(${NAME} ${TESTCASE_NAME})
     endif()
 
     foreach(DEPENDEE ${__DEPENDEES})
-      add_dependencies(${DEPENDEE} ${TEST_NAME})
+      add_dependencies(${DEPENDEE} ${TESTCASE_NAME})
     endforeach()
 
-    set(TEST_NAME c${TEST_NAME})
+    set(TESTCASE_NAME c${TESTCASE_NAME})
     string(CONFIGURE "${_INLINE_PER_TEST_CODE_TEMPLATE}" _TEST_CODE @ONLY)
-    string(APPEND _INLINE_PER_TEST_CODE "${_TEST_CODE}")
+    file(APPEND "${TEST_GEN_DIR}/CTestTestfile.${TEST_NAME}.cmake" "${_TEST_CODE}")
+    file(APPEND "${TEST_GEN_DIR}/CTestTestfile.${TEST_NAME}.install.cmake.in" "${_TEST_CODE}")
 
   endforeach()
 
-  # To run the tests from an install package with tests enabled, we need to generate test files
-  # that don't rely on the current directory structure in build.
-
-  set(TEST_NAME c${NAME})
-  set(TEST_GEN_DIR ${CMAKE_CURRENT_BINARY_DIR}/ctest/${TEST_NAME})
-  file(MAKE_DIRECTORY ${TEST_GEN_DIR})
-
-  set(TEST_EXE_PATH $<TARGET_FILE:${TARGET}>)
-  set(TEST_USE_EXTENDED_FORMAT ON)
-  configure_file("${CUTLASS_CTEST_TEMPLATE_FILE}" "${TEST_GEN_DIR}/CTestTestfile.${TEST_NAME}.cmake" @ONLY)
-
-  set(TEST_EXE_PATH $<TARGET_FILE_NAME:${TARGET}>)
-  set(TEST_USE_EXTENDED_FORMAT OFF) # ctest does not support extended add_test format.
-  configure_file("${CUTLASS_CTEST_TEMPLATE_FILE}" "${TEST_GEN_DIR}/CTestTestfile.${TEST_NAME}.install.cmake.in" @ONLY)
-
   # The following line imports the tests for immediate run via `make test`.
 
   include(${TEST_GEN_DIR}/CTestTestfile.${TEST_NAME}.cmake)
diff --git a/PUBLICATIONS.md b/PUBLICATIONS.md
index 65d1f08e07..ba0ef4cff8 100644
--- a/PUBLICATIONS.md
+++ b/PUBLICATIONS.md
@@ -2,6 +2,12 @@
 
 ## 2024
 
+- ["ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference"](https://arxiv.org/abs/2410.21465). Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, Beidi Chen. _arXiv_, October 2024.
+
+- ["FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion"](https://arxiv.org/abs/2406.06858). Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Chengji Yao, Ziheng Jiang, Haibin Lin, Xin Jin, Xin Liu. _arXiv_, June 2024.
+
+- ["EVT: Accelerating Deep Learning Training with Epilogue Visitor Tree"](https://dl.acm.org/doi/10.1145/3620666.3651369). Zhaodong Chen, Andrew Kerr, Richard Cai, Jack Kosaian, Haicheng Wu, Yufei Ding, and Yuan Xie. _Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems_, April 2024.
+
 - ["Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level"](https://arxiv.org/abs/2403.04690). Ali Hassani, Wen-Mei Hwu, Humphrey Shi. _arXiv_, March 2024.
 
 ## 2023
@@ -24,6 +30,8 @@
 
 - ["Mixed Precision Post Training Quantization of Neural Networks with Sensitivity Guided Search"](https://arxiv.org/abs/2302.01382). Clemens JS Schaefer, Elfie Guo, Caitlin Stanton, Xiaofan Zhang, Tom Jablin, Navid Lambert-Shirzad, Jian Li, Chiachen Chou, Siddharth Joshi, Yu Emma Wang. _arXiv_, Feburary 2023.
 
+- ["Dynamic N:M Fine-Grained Structured Sparse Attention Mechanism"](https://dl.acm.org/doi/abs/10.1145/3572848.3577500). Zhaodong Chen, Zheng Qu, Yuying Quan, Liu Liu, Yufei Ding, Yuan Xie. _Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming_, Feburary 2023.
+
 - ["Stream-K: Work-centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU"](https://arxiv.org/abs/2301.03598). Muhammad Osama, Duane Merrill, Cris Cecka, Michael Garland, John D. Owens. _arXiv_, January 2023.
 
 ## 2022
diff --git a/README.md b/README.md
index 9ac15f4165..efe47872c9 100644
--- a/README.md
+++ b/README.md
@@ -1,8 +1,8 @@
 ![ALT](./media/images/gemm-hierarchy-with-epilogue-no-labels.png "Complete CUDA GEMM decomposition")
 
-# CUTLASS 3.5.1
+# CUTLASS 3.6.0
 
-_CUTLASS 3.5.1 - July 2024_
+_CUTLASS 3.6.0 - October 2024_
 
 CUTLASS is a collection of CUDA C++ template abstractions for implementing
 high-performance matrix-matrix multiplication (GEMM) and related computations at all levels 
@@ -41,48 +41,26 @@ and improves code composability and readability. More documentation specific to
 
 In addition to GEMMs, CUTLASS implements high-performance convolution via the implicit GEMM algorithm. Implicit GEMM is the formulation of a convolution operation as a GEMM thereby taking advantage of CUTLASS's modular GEMM pipeline. This allows CUTLASS to build convolutions by reusing highly-optimized GEMM components.
 
-
-# What's New in CUTLASS 3.5
-
-CUTLASS 3.5.1 is an update to CUTLASS adding:
-
-- [Minimal SM90 WGMMA + TMA GEMM example in 100 lines of code](./examples/cute/tutorial/wgmma_sm90.cu).
-- [Exposure of L2 `cache_hint`s in TMA copy atoms](./include/cute/arch/copy_sm90_tma.hpp#L48)
-- Exposure of raster order and tile swizzle extent in [CUTLASS library profiler](./media/docs/profiler.md#GEMM), and
-[example 48](./examples/48_hopper_warp_specialized_gemm/48_hopper_warp_specialized_gemm.cu).
-- [TMA store based and EVT supported epilogues](./include/cutlass/epilogue/collective/sm90_epilogue_array_tma_warpspecialized.hpp) for [Hopper pointer array batched kernels](./test/unit/gemm/device/sm90_gemm_f16_f16_f16_tensor_op_f32_ptr_array.cu).
-- A new [`GemmSparseUniversal` API for CUTLASS 2.x Ampere kernels](./include/cutlass/gemm/device/gemm_sparse_universal.h) leveraging 2:4 structured sparsity and [support for LLM friendly tile sizes](./test/unit/gemm/device/gemm_f16n_f16t_f32t_tensor_op_f32_sparse_sm80.cu).
-- [CUDA host adapter](./include/cutlass/cuda_host_adapter.hpp) extensions to support TMA descriptor construction driver APIs.
-- Inclusion of more [Hopper fprop, dgrad, and wgrad convolution kernels in CUTLASS library and profiler](./python/cutlass_library/generator.py).
-- Support for residual add (beta != 0) in convolution kernels.
-- A refactor of [include files throughout CUTLASS core directories](./include/cutlass/gemm/collective/collective_mma_decl.hpp) to reduce circular dependencies and [tests to guard against them](./test/self_contained_includes/CMakeLists.txt).
-- [A guide for setting up VSCode to work well with CUTLASS](./media/docs/ide_setup.md) and [expanded code style guide](./media/docs/programming_guidelines.md).
-- Better support for MSVC as a host compiler.
-- Many performance optimizations, improvements, and bug fixes including fixes for FlashAttention-2.
-- Optimal code generation with CUDA toolkit versions 12.4 and 12.5u1.
-- NOTICE:
-  + Upcoming CUTLASS 3.6 release will include a breaking refactor to the CUTLASS 3.x convolution `kernel::ConvUniversal` API to bring it in line with `gemm::GemmUniversal`. After this, the 3.x convolution API will no longer be considered as a beta API.
-  + Upcoming CUTLASS 3.6 release will include a breaking refactor to the Hopper TMA pointer array batched epilogue in order to support grouped GEMMs.
-
-CUTLASS 3.5.0 is an update to CUTLASS adding:
-
-- Implicit GEMM Convolutions targeting Hopper SM90A via WGMMA + [TMA im2col](./include/cute/atom/copy_traits_sm90_im2col.hpp).
-  + Native implementation in CUTLASS 3.x using CuTe, mirroring the [same design hierarchy as that of GEMMs](./media/docs/gemm_api_3x.md).
-  + Support for 1D, 2D, and 3D convolutions in a [rank-agnostic fashion](./include/cutlass/conv/convnd_problem_shape.hpp).
-  + Support for [Fprop](./test/unit/conv/device_3x/fprop/sm90_conv3d_fprop_implicit_gemm_s8_s8_s32_tensorop_s32.cu), [Dgrad](./test/unit/conv/device_3x/dgrad/sm90_conv2d_dgrad_implicit_gemm_f16_f16_f32_tensorop_f16.cu), and [Wgrad](./test/unit/conv/device_3x/wgrad/sm90_conv1d_wgrad_implicit_gemm_f16_f16_f32_tensorop_f16.cu) algorithms.
-  + [CUTLASS profiler support](./python/cutlass_library/conv3x_emitter.py) for 2D and 3D convolutions implemented via the 3.x API.
-  + NOTE: this is a beta release. Further updates to CUTLASS will include major performance improvements, feature enablement, and possible breaking changes to the API until 3.7 release. Your feedback is welcome on the design!
-- Support for [Ada (SM89) FP8 tensor cores via the 2.x API](./examples/58_ada_fp8_gemm/ada_fp8_gemm.cu). Requires CUDA 12.4 or newer.
-- [Ampere gather/scatter convolution example](./examples/59_ampere_gather_scatter_gemm/README.md) in CuTe and CUTLASS 3.x.
-  + Showcasing how custom kernels can be written and optimized using CUTLASS 3.x and CuTe and the general strategy for implementing convolutions as specializations of GETTs.
-  + Implementation of a coarse grained sparse gather/scatter kernel achieving peak performance on Ampere class tensor cores.
-- 32x and 16x tile sizes are added to CUTLASS 2.x to improve the performance of narrow-tall and wide-short matrices.
-- Updates to CuTe documentation for [`cute::Tensor<>`](./media/docs/cute/03_tensor.md), [MMA atoms](./media/docs/cute/0t_mma_atom.md), and an overhauled [CuTe GEMM tutorial series](./examples/cute/tutorial).
-- Extensions to CuTe to support [L2 prefetching](./include/cute/algorithm/prefetch.hpp) and [TMA store+reductions](./include/cute/arch/copy_sm90_tma.hpp#L1337).
-- Remove C++11 requirement on a few CUTLASS 2.x API header files. All CUTLASS files now require C++17.
-- Fixes to greatly reduce build warnings.
-- Updates and bugfixes from the community (thanks!)
-- CUTLASS 3.5.1 is a minor update to CUTLASS containing small bug fixes and improvements, including fixes for FlashAttention-2 builds.
+# What's New in CUTLASS 3.6
+
+CUTLASS 3.6.0 is an update to CUTLASS adding:
+
+- [Hopper structured sparse GEMM](./examples/62_hopper_sparse_gemm/62_hopper_sparse_gemm.cu).
+  + [FP16](./test/unit/gemm/device/sm90_sparse_gemm_f16_f16_f32_tensor_op_f32.cu)
+  + [FP8](./test/unit/gemm/device/sm90_sparse_gemm_f8_f8_f32_tensor_op_f32.cu)
+  + [INT8](./test/unit/gemm/device/sm90_sparse_gemm_s8_s8_s32_tensor_op_s32.cu)
+  + [TF32](./test/unit/gemm/device/sm90_sparse_gemm_tf32_tf32_f32_tensor_op_f32.cu)
+- A refactor to the CUTLASS 3.x convolution `kernel::ConvUniversal` [API](./include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp) to bring it in line with `gemm::GemmUniversal`. Now the 3.x convolution API is no longer considered as a beta API.
+- [An improved mixed input GEMM](./examples/55_hopper_mixed_dtype_gemm/README.md) and a [lookup table implementation](./examples/55_hopper_mixed_dtype_gemm/55_hopper_int4_fp8_gemm.cu) for `INT4`x`FP8` scale-only mode.
+- [EVT nodes for Top-K selection and softmax](./include/cutlass/epilogue/fusion/sm90_visitor_topk_softmax.hpp) and [GEMM example using those](./examples/61_hopper_gemm_with_topk_and_softmax/61_hopper_gemm_with_topk_and_softmax.cu).
+- [Programmatic Dependent Launch](./include/cutlass/arch/grid_dependency_control.h) (PDL) that leverages a new Hopper feature to speedup two back-to-back kernels, and its corresponding [documentations](./media/docs/dependent_kernel_launch.md).
+- [A new debugging tool, synclog](./include/cutlass/arch/synclog.hpp), for dumping out all synchronization events from within a kernel to a file. Please see [synclog documentation](./media/docs/utilities.md#debugging-asynchronous-kernels-with-cutlasss-built-in-synclog-tool) for details.
+- A new TMA-enabled [epilogue](./include/cutlass/epilogue/collective/sm90_epilogue_array_tma_warpspecialized.hpp) for grouped GEMM that brings significant performance improvement, as well as its EVT support.
+- A SIMT-enabled pointer-array [epilogue](./include/cutlass/epilogue/collective/sm70_epilogue_vectorized_array.hpp).
+- A new [Ping-Pong kernel schedule for Grouped GEMM](./include/cutlass/gemm/kernel/sm90_gemm_array_tma_warpspecialized_pingpong.hpp) and some other optimizations.
+- [A new instantiation strategy for CUTLASS profiler kernels](./python/cutlass_library/sm90_shapes.py) along with [improved documentation for instantiation level in CUTLASS profiler](./media/docs/profiler.md#instantiating-more-kernels-with-hopper).
+- A new hardware support for comparisons and computations of [`cutlass::bfloat16_t`](./include/cutlass/bfloat16.h)
+- Fixed use of isnan on Windows for [`half_t`](./test/unit/core/functional.cu).
 
 Minimum requirements:
 
@@ -101,16 +79,15 @@ Starting from CUTLASS 3.0, CUTLASS removed support for the following:
 
 # Performance
 
-<p align="center"><img src=media/images/cutlass-3.1-gemm-peak-performance.png></p>
+<p align="center"><img src=media/images/cutlass-3.5.1-gemm-peak-performance.png></p>
+<p align="center"><img src=media/images/cutlass-3.5.1-gemm-peak-performance-fp8.png></p>
 
 CUTLASS primitives are very efficient.  When used to construct device-wide GEMM kernels,
 they exhibit peak performance comparable to cuBLAS for scalar GEMM
-computations. The above figure shows CUTLASS performance relative to cuBLAS
-for large matrix dimensions on an [NVIDIA H100](https://www.nvidia.com/en-us/data-center/h100/) (NVIDIA Hopper architecture), 
-an [NVIDIA L40](https://www.nvidia.com/en-us/data-center/l40/) (NVIDIA Ada architecture),
-an [NVIDIA A100](https://www.nvidia.com/en-us/data-center/a100/) (NVIDIA Ampere architecture),  
-and an [NVIDIA A40](https://www.nvidia.com/en-us/data-center/a40/)  (NVIDIA Ampere architecture).
-CUTLASS 3.0 was compiled with the [CUDA 12.0 Toolkit](https://developer.nvidia.com/cuda-downloads). 
+computations. The above figure shows the continual CUTLASS performance improvements 
+on an [NVIDIA H100](https://www.nvidia.com/en-us/data-center/h100/) (NVIDIA Hopper architecture) since
+CUTLASS 3.1.
+CUTLASS 3.5.1 was compiled with the [CUDA 12.5u1 Toolkit](https://developer.nvidia.com/cuda-downloads). 
 Tensor Core operations are implemented using CUDA's 
 [mma](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-mma) and
 [wgmma](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#asynchronous-warpgroup-level-matrix-instructions) instructions.
@@ -163,7 +140,7 @@ CUTLASS runs successfully on the following NVIDIA GPUs, and it is expected to be
 
 In general, PTX code generated for one target architecture can be run on future architectures (i.e., it is forward compatible).  However, CUDA 12.0 introduced the concept of "architecture-accelerated features" whose PTX does not have forward compatibility guarantees. Several Hopper PTX instructions fall under this category of architecture-accelerated features, and thus require a `sm_90a` target architecture (note the "a" appended). For more details on this and other architecture-accelerated instructions, please refer to the [CUDA Documentation](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#feature-availability).
 
-The target architecture information is passed on to CUTLASS via the cmake flag `CUTLASS_NVCC_ARCHS`. In order to maximize performance on Hopper GH100, users are required to build CUTLASS with `90a` as the target architecture. If a user accidentally builds a kernel which uses SM90a features (e.g. Hopper Tensor Core Instructions), using the SM90 target (note the lack of "a"), with either CTK 12 or 11.8, the kernel is expected to fail with a runtime error.
+The target architecture information is passed on to CUTLASS via the cmake flag `CUTLASS_NVCC_ARCHS`. In order to maximize performance on Hopper GH100, users are required to build CUTLASS with `90a` as the target architecture. If a user accidentally builds a kernel which uses SM90a features (e.g. Hopper Tensor Core Instructions), using the SM90 target (note the lack of "a"), with either CUDA Toolkit 12 or 11.8, the kernel is expected to fail with a runtime error.
 
 ```
 cmake .. -DCUTLASS_NVCC_ARCHS="90a" 
@@ -191,6 +168,8 @@ CUTLASS is described in the following documents and the accompanying
 - [Tile Iterators](./media/docs/tile_iterator_concept.md) - describes C++ concepts for iterating over tiles of matrices in memory
 - [CUTLASS Profiler](./media/docs/profiler.md) - command-line driven profiling application
 - [CUTLASS Utilities](./media/docs/utilities.md) - additional templates used to facilate rapid development
+- [Dependent kernel launch](./media/docs/dependent_kernel_launch.md) - describes a new feature in Hopper which allows overlapping dependent 
+kernels in the same stream, and how it is used in CUTLASS.
 
 # Resources
 We have also described the structure of an efficient GEMM in our talk at the
diff --git a/cmake/CTestTestfile.configure.cmake b/cmake/CTestTestfile.configure.cmake
index 94394a5000..611b3d181f 100644
--- a/cmake/CTestTestfile.configure.cmake
+++ b/cmake/CTestTestfile.configure.cmake
@@ -50,5 +50,3 @@ if (DEFINED ENV{CUTLASS_TEST_EXECUTION_ENVIRONMENT})
 else()
   set(_CUTLASS_TEST_EXECUTION_ENVIRONMENT @CUTLASS_TEST_EXECUTION_ENVIRONMENT@)
 endif()
-
-@_INLINE_PER_TEST_CODE@
diff --git a/cmake/CTestTestfile.test.configure.cmake b/cmake/CTestTestfile.test.configure.cmake
index fa2ceeb9bd..31dba54498 100644
--- a/cmake/CTestTestfile.test.configure.cmake
+++ b/cmake/CTestTestfile.test.configure.cmake
@@ -30,14 +30,14 @@ if (CUTLASS_USE_EXTENDED_ADD_TEST_FORMAT)
   # The longform/extended format allows generator expressions to be
   # expanded property and is useful in contexts where the files need
   # to be immediately included into being-processed cmake code.
-  add_test(NAME @TEST_NAME@ COMMAND ${_CUTLASS_TEST_EXECUTION_ENVIRONMENT} "${TEST_EXE_PATH}" @TEST_COMMAND_OPTIONS@)
+  add_test(NAME @TESTCASE_NAME@ COMMAND ${_CUTLASS_TEST_EXECUTION_ENVIRONMENT} "${TEST_EXE_PATH}" @TEST_COMMAND_OPTIONS@)
 else()
-  add_test(@TEST_NAME@ ${_CUTLASS_TEST_EXECUTION_ENVIRONMENT} "${TEST_EXE_PATH}" @TEST_COMMAND_OPTIONS@)
+  add_test(@TESTCASE_NAME@ ${_CUTLASS_TEST_EXECUTION_ENVIRONMENT} "${TEST_EXE_PATH}" @TEST_COMMAND_OPTIONS@)
 endif()
 
 if (TEST_EXE_WORKING_DIRECTORY)
-  set_tests_properties(@TEST_NAME@ PROPERTIES WORKING_DIRECTORY "${TEST_EXE_WORKING_DIRECTORY}")
+  set_tests_properties(@TESTCASE_NAME@ PROPERTIES WORKING_DIRECTORY "${TEST_EXE_WORKING_DIRECTORY}")
 endif()
 
-set_tests_properties(@TEST_NAME@ PROPERTIES DISABLED @__DISABLE_TESTS@)
+set_tests_properties(@TESTCASE_NAME@ PROPERTIES DISABLED @__DISABLE_TESTS@)
 
diff --git a/cmake/googletest.cmake b/cmake/googletest.cmake
index 0350fb2dd1..d220cfadc2 100644
--- a/cmake/googletest.cmake
+++ b/cmake/googletest.cmake
@@ -34,9 +34,10 @@ if(GOOGLETEST_DIR)
   set(FETCHCONTENT_SOURCE_DIR_GOOGLETEST ${GOOGLETEST_DIR} CACHE STRING "GoogleTest source directory override")
 endif()
 
+set(GTEST_REPOSITORY "https://github.com/google/googletest.git" CACHE STRING "GoogleTest repo to fetch")
 FetchContent_Declare(
   googletest
-  GIT_REPOSITORY https://github.com/google/googletest.git
+  GIT_REPOSITORY ${GTEST_REPOSITORY}
   GIT_TAG        v1.14.0
   )
 
diff --git a/examples/35_gemm_softmax/gemm_softmax.cu b/examples/35_gemm_softmax/gemm_softmax.cu
index 27156ea02d..731e37b4d9 100644
--- a/examples/35_gemm_softmax/gemm_softmax.cu
+++ b/examples/35_gemm_softmax/gemm_softmax.cu
@@ -42,7 +42,8 @@
 #include "cutlass/arch/memory.h"
 #include "cutlass/arch/memory_sm75.h"
 #include "cutlass/gemm/device/gemm_complex.h"
-
+#include "cutlass/numeric_types.h"
+#include "cutlass/numeric_size.h"
 #include "cutlass/util/command_line.h"
 #include "cutlass/util/host_tensor.h"
 
@@ -56,6 +57,7 @@
 #include "cutlass/util/reference/host/tensor_fill.h"
 #include "cutlass/util/reference/host/error_metrics.h"
 #include "cutlass/util/tensor_view_io.h"
+#include "cutlass/numeric_size.h" // cutlass::bits_to_bytes
 
 #include "cutlass/layout/matrix.h"
 #include "cutlass/epilogue/thread/linear_combination.h"
@@ -657,7 +659,9 @@ struct Testbed {
     }
 
     int64_t flops = int64_t(options.problem_size.m()) * options.problem_size.n() * options.problem_size.k() * 2;
-    int64_t bytes = (sizeof(ElementD) * 2 + sizeof(ElementSoftmax)) * options.problem_size.m() * options.problem_size.n();
+    int64_t bytes = cutlass::bits_to_bytes(
+      (cutlass::sizeof_bits<ElementD>::value * 2 + cutlass::sizeof_bits<ElementSoftmax>::value) *
+      options.problem_size.m() * options.problem_size.n());
 
     double gflops_per_second = double(flops) * kIterations * options.batch_count / double(elapsed_ms / 1000.0f) / double(1.0e9);
     double gbytes_per_second = double(bytes) * kIterations * options.batch_count / double(elapsed_ms / 1000.0f) / double(1 << 30);
diff --git a/examples/48_hopper_warp_specialized_gemm/48_hopper_warp_specialized_gemm.cu b/examples/48_hopper_warp_specialized_gemm/48_hopper_warp_specialized_gemm.cu
index f26f4da37d..164c785e01 100644
--- a/examples/48_hopper_warp_specialized_gemm/48_hopper_warp_specialized_gemm.cu
+++ b/examples/48_hopper_warp_specialized_gemm/48_hopper_warp_specialized_gemm.cu
@@ -303,14 +303,14 @@ bool initialize_block(
   int bits_input = cutlass::sizeof_bits<Element>::value;
 
   if (bits_input == 1) {
-    scope_max = 2;
-    scope_min = 0;
+    scope_max = Element(2);
+    scope_min = Element(0);
   } else if (bits_input <= 8) {
-    scope_max = 2;
-    scope_min = -2;
+    scope_max = Element(2);
+    scope_min = Element(-2);
   } else {
-    scope_max = 8;
-    scope_min = -8;
+    scope_max = Element(8);
+    scope_min = Element(-8);
   }
 
   cutlass::reference::device::BlockFillRandomUniform(
diff --git a/examples/52_hopper_gather_scatter_fusion/gather_gemm.hpp b/examples/52_hopper_gather_scatter_fusion/gather_gemm.hpp
index 57053b0f9a..c71109aa79 100644
--- a/examples/52_hopper_gather_scatter_fusion/gather_gemm.hpp
+++ b/examples/52_hopper_gather_scatter_fusion/gather_gemm.hpp
@@ -111,7 +111,7 @@ class GemmGather
       EpilogueTensorStorage epilogue;
     } tensors;
 
-    struct PipelineStorage : cute::aligned_struct<16> {
+    struct PipelineStorage : cute::aligned_struct<16, _2> {
       using MainloopPipelineStorage = typename CollectiveMainloop::PipelineStorage;
       using EpiLoadPipelineStorage = typename CollectiveEpilogue::PipelineStorage;
 
diff --git a/examples/53_hopper_gemm_permute/permute_traits.hpp b/examples/53_hopper_gemm_permute/permute_traits.hpp
index 96fcc64cf9..4c5baccac5 100644
--- a/examples/53_hopper_gemm_permute/permute_traits.hpp
+++ b/examples/53_hopper_gemm_permute/permute_traits.hpp
@@ -50,7 +50,7 @@ struct PermuteTraits {};
 using X = Underscore;
 
 // Reshape a rank-2 shape into a multidimensional shape.
-// Input: 
+// Input:
 //   shape = (A, B, ...)
 //   target_shape = ((A1, ..., X, ..., Am), (B1, ..., X, ..., Bn), ...)
 // Output:
@@ -76,12 +76,12 @@ reshape(Shape const& shape, TargetShape const& target_shape)
 // - sub-modes corresponding to the implied multidimensional shape of the source tensor
 // - strides accounting for the permutation operation being performed
 template<class Permute, bool Transpose, class Shape, class Stride>
-constexpr auto 
+constexpr auto
 make_permute_layout(Layout<Shape,Stride> const& layout) {
   static_assert(cute::rank(Shape{}) == 3, "Only rank-3 layouts are supported");
   if constexpr (Transpose) {
     // Deal with tensor B by transposing appropriately before and after computing the permute layout.
-    // Its CuTe-canonical mode order is [N,K,L], while permute operations expect [row,col,batch]. 
+    // Its CuTe-canonical mode order is [N,K,L], while permute operations expect [row,col,batch].
     return select<1,0,2>(make_permute_layout<Permute, false>(select<1,0,2>(layout)));
   }
   else {
@@ -129,23 +129,24 @@ inverse(Permutation const & perm) {
 template<class T>
 using inverse_t = decltype(inverse(T{}));
 
-// Given a rank-2 layout of tensor that is assumed to have been permuted, 
+// Given a rank-2 layout of tensor that is assumed to have been permuted,
 // compute the original rank-2 layout of the tensor prior to the permutation.
-// This is needed to form the correct input to the standalone permutation kernel. 
+// This is needed to form the correct input to the standalone permutation kernel.
 template<class Permute, bool Transpose, class Shape, class Stride>
-constexpr auto 
+constexpr auto
 make_original_layout(Layout<Shape,Stride> const& layout) {
   static_assert(cute::rank(Shape{}) == 3, "Only rank-3 layouts are supported");
   if constexpr (Transpose) {
     // Deal with tensor B by transposing appropriately before and after computing the permute layout.
-    // Its CuTe-canonical mode order is [N,K,L], while permute operations expect [row,col,batch]. 
+    // Its CuTe-canonical mode order is [N,K,L], while permute operations expect [row,col,batch].
     return select<1,0,2>(make_original_layout<Permute, false>(select<1,0,2>(layout)));
   }
   else {
     using ShapeProfile = typename PermuteTraits<Permute>::ShapeProfile;
+    auto re_shape   = flatten(reshape(layout.shape(), ShapeProfile{}));
     using IndexOrder   = typename PermuteTraits<Permute>::IndexOrder;
+    auto orig_shape = transform_leaf(IndexOrder{}, [&](auto i){ return get<i>(re_shape); });
     using OrigOrder    = conditional_t<cutlass::gemm::detail::is_major<0,Stride>(), seq<0,1,2>, seq<1,0,2>>;
-    auto orig_shape = select(flatten(reshape(layout.shape(), ShapeProfile{})), IndexOrder{});
     // print("Permuted shape: "); print(reshape(layout.shape(), ShapeProfile{})); print("\n");
     // print("Original shape: "); print(orig_shape); print("\n");
     return make_ordered_layout(product_each(orig_shape), OrigOrder{});
@@ -202,7 +203,7 @@ struct PermuteTraits<cutlass::layout::Tensor4DPermuteBMM0321ColumnMajor<D>>
 };
 
 template<int D>
-struct PermuteTraits<cutlass::layout::Tensor4DPermuteBMM0321ColumnMajorInverse<D>> 
+struct PermuteTraits<cutlass::layout::Tensor4DPermuteBMM0321ColumnMajorInverse<D>>
 {
   static constexpr bool kBatched = true;
   using ShapeProfile = Shape<Shape<X,Int<D>>, Shape<X>, Shape<X>>;
@@ -222,7 +223,7 @@ struct PermuteTraits<cutlass::layout::Tensor4DPermuteBMM0213RowMajor<D>>
 };
 
 template<int D>
-struct PermuteTraits<cutlass::layout::Tensor4DPermuteBMM0213RowMajorInverse<D>> 
+struct PermuteTraits<cutlass::layout::Tensor4DPermuteBMM0213RowMajorInverse<D>>
 {
   static constexpr bool kBatched = true;
   using ShapeProfile = Shape<Shape<X>, Shape<X,Int<D>>, Shape<X>>;
diff --git a/examples/55_hopper_mixed_dtype_gemm/55_hopper_int4_bf16_gemm.cu b/examples/55_hopper_mixed_dtype_gemm/55_hopper_int4_bf16_gemm.cu
new file mode 100644
index 0000000000..9346734aec
--- /dev/null
+++ b/examples/55_hopper_mixed_dtype_gemm/55_hopper_int4_bf16_gemm.cu
@@ -0,0 +1,657 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+/*! \file
+    \brief Hopper GEMM example with different data types using CUTLASS 3.0 APIs for NVIDIA Hopper architecture
+
+    This example shows how to perform INT4 x BF16 GEMM and scale up the INT4 weight during dequantization.
+
+    The narrower type always passes through the register file. Therefore, in cases where the narrower type is operand B, the collective will implicitly swap 
+    A and B in the main loop. However, as a result of this collective performing implicit swaps, it does not support TMA epilogues. Consequently, it is essential to consider this when constructing the epilogue, 
+    as illustrated in this example.
+
+    Note that in this example, we explicitly swap A and B in order to use TMA epilogues. We do this since TMA epilogues are more performant on problem sizes of interest.
+
+    As an additional optimization, we can reorder the narrow data type tensor such that elements read into register file by the same thread are contiguous in global and shared memory.
+    This promotes vectorization of shared memory loads and removes additional instructions on the critical path. For example, when MMA is performed in FP8 data type, each thread reads
+    4 groups of 2 elements that are logically contiguous in the same row (refer to https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#wgmma-64n16-a for thread-value layout).
+    If the narrow type is INT4 and tensor is major in K dim, only 8 bits can be read at a time, leading to extra load instructions and suboptimal utilization of shared memory throughput.
+    If we reorder the data offline to place all 16 elements read by a thread contiguously in memory, a single 64-bit load is sufficient. This reordering is often feasible when the quantized
+    tensor is static (e.g. weight tensor of a NN layer at inference time). This example demonstrates how such a reordering can be performed and communicated to the kernel when the macro
+    OPTIMIZE_WEIGHT_LAYOUT is set to 1.
+
+    It is expected that the scale's K dimension be scale_k = ceil_div(problem_k, group_size). 
+    
+    Scales are always expected to be MN major. This means the fastest changing dimension must be M if A is scaled or N if B is scaled.
+    
+    If A is being scaled, the scales must have shape [M, scale_k],  while if B is scaled, it must have shape [N, scale_k].
+
+    The implementation only supports "group-wise" scales. However, we can make it work for per-column scales by setting the group's size
+    equal to the gemm problem K.
+
+    Limitations:
+      1) Only supports INT4 x { FP16, BF16 }. The scales must be the same as mma Type. Scale with zero-point mode is not supported.
+      2) The INT4 weights have additional encoding requirements.
+      3) The scales must be MN major. That means if A is scaled, it must be column major, but if B is scaled it must be row major.
+      4) The scales must have the same layout and groupsize.
+      5) The groupsize must be greater or equal to the tile shape k.
+      6) Currently, TMA epilogues cannot be used when the narrow type is the B operand. This limitation arises because the implementation always swaps the 
+         operands to ensure that the narrow type passes through the register file, and TMA epilogues do not currently support implicit swap + transpose operations. 
+         We plan to address this limitation in the future. However, we address this in the example by explicitly swapping and transposing the operands.
+    
+    Optimizing suggestions:
+      1) Use a small tile size, since the register pressure for this GEMM (and RS GEMM in general) is high (it uses a lot of register space).
+
+    Examples:
+      
+      Runs the mixed input batched gemm (with batch size 2), converting B to the type of A (mode 0)
+      $ ./examples/55_hopper_mixed_dtype_gemm/55_hopper_int4_bf16_gemm --m=2048 --n=2048 --k=2048 --l=2 --mode=0
+
+      Runs the mixed input gemm, and applies a scaling factor to B before mma (mode 1). Applies a vector of scales to the entire
+      matrix (group size is the same as the gemm k dimension).
+      $ ./examples/55_hopper_mixed_dtype_gemm/55_hopper_int4_bf16_gemm --m=4096 --n=5120 --k=8192 --g=8192 --mode=1
+*/
+
+#include <iostream>
+
+#include "cutlass/cutlass.h"
+
+#include "cute/tensor.hpp"
+#include "cutlass/tensor_ref.h"
+#include "cutlass/epilogue/collective/default_epilogue.hpp"
+#include "cutlass/epilogue/thread/linear_combination.h"
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+
+#include "cutlass/util/command_line.h"
+#include "cutlass/util/distribution.h"
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/packed_stride.hpp"
+#include "cutlass/util/tensor_view_io.h"
+#include "cutlass/util/reference/device/tensor_fill.h"
+#include "cutlass/util/reference/device/tensor_compare.h"
+
+#include "helper.h"
+#include "unfused_weight_dequantize.hpp"
+#include "packed_scale.hpp"
+#include "reorder_utils.hpp"
+
+using namespace cute;
+
+#define OPTIMIZE_WEIGHT_LAYOUT 1
+
+#if defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM kernel configurations
+/////////////////////////////////////////////////////////////////////////////////////////////////
+using MmaType = cutlass::bfloat16_t;
+using QuantType = cutlass::int4b_t;
+constexpr int TileShapeK = 128 * 8 / sizeof_bits<MmaType>::value;
+
+// A matrix configuration
+using         ElementA    = MmaType;                                        // Element type for A matrix operand
+using         LayoutA     = cutlass::layout::RowMajor;                      // Layout type for A matrix operand
+constexpr int AlignmentA  = 128 / cutlass::sizeof_bits<ElementA>::value;    // Memory access granularity/alignment of A matrix in units of elements (up to 16 bytes)
+
+// B matrix configuration
+using         ElementB    = QuantType;                                      // Element type for B matrix operand
+using         LayoutB     = cutlass::layout::ColumnMajor;                   // Layout type for B matrix operand
+constexpr int AlignmentB  = 128 / cutlass::sizeof_bits<ElementB>::value;    // Memory access granularity/alignment of B matrix in units of elements (up to 16 bytes)
+
+// This example manually swaps and transposes, so keep transpose of input layouts
+using LayoutA_Transpose = typename cutlass::layout::LayoutTranspose<LayoutA>::type;
+using LayoutB_Transpose = typename cutlass::layout::LayoutTranspose<LayoutB>::type;
+
+using StrideA = cutlass::detail::TagToStrideA_t<LayoutA>;
+using StrideB = cutlass::detail::TagToStrideB_t<LayoutB>;
+
+#if OPTIMIZE_WEIGHT_LAYOUT
+// Define the CuTe layout for reoredered quantized tensor B
+// LayoutAtomQuant places values that will be read by the same thread in contiguous locations in global memory.
+// It specifies the reordering within a single warp's fragment
+using LayoutAtomQuant = decltype(compute_memory_reordering_atom<MmaType>());
+using LayoutB_Reordered = decltype(tile_to_shape(LayoutAtomQuant{}, Layout<Shape<int,int,int>, StrideB>{}));
+#endif
+
+using ElementScale = MmaType;
+using ElementZero = ElementScale; // only for verify
+using LayoutScale = cutlass::layout::RowMajor;
+
+// C/D matrix configuration
+using         ElementC    = cutlass::half_t;                                // Element type for C and D matrix operands
+using         LayoutC     = cutlass::layout::RowMajor;                      // Layout type for C and D matrix operands
+constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;    // Memory access granularity/alignment of C matrix in units of elements (up to 16 bytes)
+
+// D matrix configuration
+using         ElementD    = ElementC;
+using         LayoutD     = LayoutC;
+constexpr int AlignmentD  = 128 / cutlass::sizeof_bits<ElementD>::value;
+
+// Core kernel configurations
+using ElementAccumulator  = float;                                          // Element type for internal accumulation
+using ElementCompute      = float;                                          // Element type for epilogue computation
+using ArchTag             = cutlass::arch::Sm90;                            // Tag indicating the minimum SM that supports the intended feature
+using OperatorClass       = cutlass::arch::OpClassTensorOp;                 // Operator class tag
+using TileShape           = Shape<_128,_128,cute::Int<TileShapeK>>;         // Threadblock-level tile size
+using ClusterShape        = Shape<_1,_1,_1>;                                // Shape of the threadblocks in a cluster
+using KernelSchedule      = cutlass::gemm::KernelTmaWarpSpecializedCooperativeMixedInput;  // Kernel to launch based on the default setting in the Collective Builder 
+using EpilogueSchedule    = cutlass::epilogue::TmaWarpSpecializedCooperative;
+using EpilogueTileType    = cutlass::epilogue::collective::EpilogueTileAuto;
+
+using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+    cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+    TileShape, ClusterShape,
+    EpilogueTileType,
+    ElementAccumulator, ElementAccumulator,
+    // Transpose layout of D here since we use explicit swap + transpose
+    // the void type for C tells the builder to allocate 0 smem for the C matrix.
+    // We can enable this if beta == 0 by changing ElementC to void below.
+    ElementC, typename cutlass::layout::LayoutTranspose<LayoutC>::type, AlignmentC,
+    ElementD, typename cutlass::layout::LayoutTranspose<LayoutD>::type, AlignmentD,
+    EpilogueSchedule // This is the only epi supporting the required swap + transpose.
+  >::CollectiveOp;
+
+// =========================================================== MIXED INPUT WITH SCALES ===========================================================================
+// The Scale information must get paired with the operand that will be scaled. In this example, B is scaled so we make a tuple of B's information and the scale information.
+using CollectiveMainloopScaleOnly = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+#if OPTIMIZE_WEIGHT_LAYOUT
+    cute::tuple<ElementB, ElementScale>, LayoutB_Reordered, AlignmentB,
+#else
+    cute::tuple<ElementB, ElementScale>, LayoutB_Transpose, AlignmentB,
+#endif
+    ElementA, LayoutA_Transpose, AlignmentA,
+    ElementAccumulator,
+    TileShape, ClusterShape,
+    cutlass::gemm::collective::StageCountAutoCarveout<
+      static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))
+    >,
+    KernelSchedule
+  >::CollectiveOp;
+
+using GemmKernelScaleOnly = cutlass::gemm::kernel::GemmUniversal<
+    Shape<int,int,int,int>, // Indicates ProblemShape
+    CollectiveMainloopScaleOnly,
+    CollectiveEpilogue
+>;
+
+using GemmScaleOnly = cutlass::gemm::device::GemmUniversalAdapter<GemmKernelScaleOnly>;
+
+using StrideC = typename GemmKernelScaleOnly::StrideC;
+using StrideD = typename GemmKernelScaleOnly::StrideD;
+
+using StrideC_ref = cutlass::detail::TagToStrideC_t<LayoutC>;
+using StrideD_ref = cutlass::detail::TagToStrideC_t<LayoutD>;
+
+//
+// Data members
+//
+
+/// Initialization
+StrideA stride_A;
+StrideB stride_B;
+StrideC stride_C;
+StrideC_ref stride_C_ref;
+StrideD stride_D;
+StrideD_ref stride_D_ref;
+uint64_t seed;
+
+#if OPTIMIZE_WEIGHT_LAYOUT
+LayoutB_Reordered layout_B_reordered;
+#endif
+
+using StrideS = typename CollectiveMainloopScaleOnly::StrideScale;
+using StrideS_ref = cutlass::detail::TagToStrideB_t<LayoutScale>;
+StrideS stride_S;
+StrideS_ref stride_S_ref;
+
+cutlass::DeviceAllocation<ElementA> block_A;
+cutlass::DeviceAllocation<ElementB> block_B;
+cutlass::DeviceAllocation<ElementA> block_B_dq;
+cutlass::DeviceAllocation<ElementScale> block_scale;
+cutlass::DeviceAllocation<ElementZero> block_zero;
+cutlass::DeviceAllocation<ElementC> block_C;
+cutlass::DeviceAllocation<typename GemmScaleOnly::EpilogueOutputOp::ElementOutput> block_D;
+cutlass::DeviceAllocation<typename GemmScaleOnly::EpilogueOutputOp::ElementOutput> block_ref_D;
+
+#endif // defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Testbed utility types
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// Command line options parsing
+struct Options {
+
+  bool help = false;
+
+  float alpha = 1.0f;
+  float beta = 0.0f;
+  int iterations = 10;
+  int m = 5120, n = 4096, k = 4096;
+  int g = 128;
+  int l = 1;
+
+  // Parses the command line
+  void parse(int argc, char const **args) {
+    cutlass::CommandLine cmd(argc, args);
+
+    if (cmd.check_cmd_line_flag("help")) {
+      help = true;
+      return;
+    }
+
+    cmd.get_cmd_line_argument("m", m);
+    cmd.get_cmd_line_argument("n", n);
+    cmd.get_cmd_line_argument("k", k);
+    cmd.get_cmd_line_argument("l", l);
+    cmd.get_cmd_line_argument("g", g);
+    cmd.get_cmd_line_argument("alpha", alpha, 1.f);
+    cmd.get_cmd_line_argument("beta", beta, 0.f);
+    cmd.get_cmd_line_argument("iterations", iterations);
+  }
+
+  /// Prints the usage statement.
+  std::ostream & print_usage(std::ostream &out) const {
+
+    out << "55_hopper_warp_specialized_gemm\n\n"
+      << "  Hopper FP32 GEMM using a Warp Specialized kernel.\n\n"
+      << "Options:\n\n"
+      << "  --help                      If specified, displays this usage statement\n\n"
+      << "  --m=<int>                   Sets the M extent of the GEMM\n"
+      << "  --n=<int>                   Sets the N extent of the GEMM\n"
+      << "  --k=<int>                   Sets the K extent of the GEMM\n"
+      << "  --l=<int>                   The number of independent gemm problems with mnk shape\n"
+      << "  --g=<int>                   The size of each group for the scales. To broadcast a vector of scales or zeros, set the group size to K.\n"
+      << "  --alpha=<f32>               Epilogue scalar alpha\n"
+      << "  --beta=<f32>                Epilogue scalar beta\n\n"
+      << "  --iterations=<int>          Number of profiling iterations to perform.\n\n";
+
+    out
+      << "\n\nExamples:\n\n"
+      << "$ " << "55_hopper_warp_specialized_gemm" << " --m=1024 --n=512 --k=1024 -g 0 --l=10 --alpha=2 --mode=2 --beta=0.707 \n\n";
+
+    return out;
+  }
+
+  /// Compute performance in GFLOP/s
+  double gflops(double runtime_s) const
+  {
+    // Two flops per multiply-add
+    uint64_t flop = uint64_t(2) * m * n * k * l;
+    double gflop = double(flop) / double(1.0e9);
+    return gflop / runtime_s;
+  }
+};
+
+/// Result structure
+struct Result
+{
+  double avg_runtime_ms = 0.0;
+  double gflops = 0.0;
+  cutlass::Status status = cutlass::Status::kSuccess;
+  cudaError_t error = cudaSuccess;
+  bool passed = false;
+
+};
+
+#if defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM setup and evaluation
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Helper to initialize a block of device data
+template <class Element>
+bool initialize_tensor(
+  cutlass::DeviceAllocation<Element>& block,
+  uint64_t seed=2023) {
+
+  double scope_max, scope_min;
+  int bits_input = cutlass::sizeof_bits<Element>::value;
+  int bits_output = cutlass::sizeof_bits<Element>::value;
+
+  if (bits_input == 1) {
+    scope_max = 2;
+    scope_min = 0;
+  }
+  else if (bits_input <= 8) {
+    scope_max = 2;
+    scope_min = -2;
+  }
+  else if (bits_output == 16) {
+    scope_max = 5;
+    scope_min = -5;
+  }
+  else {
+    scope_max = 8;
+    scope_min = -8;
+  }
+  cutlass::reference::device::BlockFillRandomUniform(
+      block.get(), block.size(), seed, Element(scope_max), Element(scope_min));
+
+  return true;
+}
+
+template <typename Element>
+bool initialize_quant_tensor(
+  cutlass::DeviceAllocation<Element>& block,
+  uint64_t seed=2023) {
+  
+  float scope_min = float(cutlass::platform::numeric_limits<Element>::lowest());
+  float scope_max = float(cutlass::platform::numeric_limits<Element>::max());
+
+  cutlass::reference::device::BlockFillRandomUniform(
+    block.get(), block.size(), seed, Element(scope_max), Element(scope_min));
+
+  return true;
+}
+
+template <class Element>
+bool initialize_scale(
+  cutlass::DeviceAllocation<Element>& block, 
+  Options const& options) {
+  
+  float elt_max_f = float(cutlass::platform::numeric_limits<QuantType>::max());
+  float const max_dequant_val = 4.f;
+  float const min_dequant_val = 0.5f;
+
+  float scope_max(max_dequant_val / elt_max_f);
+  float scope_min(min_dequant_val / elt_max_f);
+
+  cutlass::reference::device::BlockFillRandomUniform(
+    block.get(), block.size(), seed, Element(scope_max), Element(scope_min));
+  return true;
+}
+
+template <class Element>
+bool initialize_zero(
+  cutlass::DeviceAllocation<Element>& block,
+  Options const& options) {
+  std::vector<Element> stage(block.size(), Element(0.0f));
+  block.copy_from_host(stage.data());
+  return true;
+}
+
+/// Initialize operands to be used in the GEMM and reference GEMM
+void initialize(Options const& options) {
+
+  auto shape_B = cute::make_shape(options.n, options.k, options.l);
+  int const scale_k = (options.k + options.g - 1) / options.g;
+  stride_A = cutlass::make_cute_packed_stride(StrideA{}, cute::make_shape(options.m, options.k, options.l));
+  stride_B = cutlass::make_cute_packed_stride(StrideB{}, shape_B);
+  // Reverse stride here due to swap and transpose
+  stride_C = cutlass::make_cute_packed_stride(StrideC{}, cute::make_shape(options.n, options.m, options.l));
+  stride_C_ref = cutlass::make_cute_packed_stride(StrideC_ref{}, cute::make_shape(options.m, options.n, options.l));
+  // Reverse stride here due to swap and transpose
+  stride_D = cutlass::make_cute_packed_stride(StrideD{}, cute::make_shape(options.n, options.m, options.l));
+  stride_D_ref = cutlass::make_cute_packed_stride(StrideD_ref{}, cute::make_shape(options.m, options.n, options.l));
+
+  auto layout_B = make_layout(shape_B, stride_B);
+
+  auto a_coord = cutlass::make_Coord(options.m * options.l, options.k);
+  auto b_coord = cutlass::make_Coord(options.k, options.n * options.l);
+  auto c_coord = cutlass::make_Coord(options.m * options.l, options.n);
+
+  block_A.reset(a_coord.product());
+  block_B.reset(b_coord.product());
+  block_B_dq.reset(b_coord.product());
+  block_C.reset(c_coord.product());
+  block_D.reset(c_coord.product());
+  block_ref_D.reset(c_coord.product());
+
+  block_scale.reset(scale_k * options.l * options.n);
+  block_zero.reset(scale_k * options.l * options.n);
+
+  initialize_tensor(block_A, seed + 2022);
+  initialize_quant_tensor(block_B, seed + 2021);
+  initialize_tensor(block_C, seed + 2020);
+  initialize_scale(block_scale, options);
+  initialize_zero(block_zero, options);
+
+  auto shape_scale_zero = cute::make_shape(options.n, scale_k, options.l);
+  stride_S = cutlass::make_cute_packed_stride(StrideS{}, cute::make_shape(options.n, scale_k, options.l));
+  stride_S_ref = cutlass::make_cute_packed_stride(StrideS_ref{}, cute::make_shape(options.n, scale_k, options.l));
+  auto layout_scale_zero = make_layout(shape_scale_zero, stride_S_ref);
+
+  dequantize_weight(block_B_dq.get(), block_B.get(), layout_B, block_scale.get(), block_zero.get(), layout_scale_zero, options.g);
+
+#if OPTIMIZE_WEIGHT_LAYOUT
+  // Repeat the reorder layout atom to tile the whole tensor shape 
+  layout_B_reordered = tile_to_shape(LayoutAtomQuant{}, shape_B);
+  reorder_tensor(block_B.get(), layout_B, layout_B_reordered);
+
+  print("Quantized tensor layout: ");
+  print(layout_B_reordered);
+  print("\n");
+#endif
+}
+
+/// Populates a Gemm::Arguments structure from the given commandline options
+template <typename Args>
+Args args_from_options(Options const& options)
+{
+// Swap the A and B tensors, as well as problem shapes here.
+  
+  return Args {
+    cutlass::gemm::GemmUniversalMode::kGemm,
+    {options.n, options.m, options.k, options.l},
+#if OPTIMIZE_WEIGHT_LAYOUT
+    {block_B.get(), layout_B_reordered, block_A.get(), stride_A, block_scale.get(), stride_S, options.g},
+#else
+    {block_B.get(), stride_B,           block_A.get(), stride_A, block_scale.get(), stride_S, options.g},
+#endif
+    {{options.alpha, options.beta}, block_C.get(), stride_C, block_D.get(), stride_D}
+  };
+}
+
+bool verify(Options const& options) {
+  //
+  // Compute reference output
+  //
+
+  using CollectiveMainloopRef = typename cutlass::gemm::collective::CollectiveBuilder<
+      ArchTag, OperatorClass,
+      MmaType, LayoutA, AlignmentA,
+      MmaType, LayoutB, AlignmentB,
+      ElementAccumulator,
+      TileShape, ClusterShape,
+      cutlass::gemm::collective::StageCountAuto,
+      cutlass::gemm::collective::KernelScheduleAuto
+    >::CollectiveOp;
+
+  using CollectiveEpilogueRef = typename cutlass::epilogue::collective::CollectiveBuilder<
+      cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+      TileShape, ClusterShape,
+      cutlass::epilogue::collective::EpilogueTileAuto,
+      ElementAccumulator, ElementAccumulator,
+      ElementC, LayoutC, AlignmentC,
+      ElementD, LayoutD, AlignmentD,
+      cutlass::epilogue::NoSmemWarpSpecialized
+    >::CollectiveOp;
+
+  using GemmKernelRef = cutlass::gemm::kernel::GemmUniversal<
+      Shape<int,int,int,int>, // Indicates ProblemShape
+      CollectiveMainloopRef,
+      CollectiveEpilogueRef
+  >;
+
+  using GemmRef = cutlass::gemm::device::GemmUniversalAdapter<GemmKernelRef>;
+
+  typename GemmRef::Arguments arguments{
+    cutlass::gemm::GemmUniversalMode::kGemm,
+    {options.m, options.n, options.k, options.l},
+    {block_A.get(), stride_A, block_B_dq.get(), stride_B},
+    {{options.alpha, options.beta}, block_C.get(), stride_C_ref, block_ref_D.get(), stride_D_ref}
+  };
+
+  // Run the gemm where the scaling is performed outside of the kernel.
+  GemmRef gemm_ref;
+  size_t workspace_size = GemmRef::get_workspace_size(arguments);
+  cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+  CUTLASS_CHECK(gemm_ref.can_implement(arguments));
+  CUTLASS_CHECK(gemm_ref.initialize(arguments, workspace.get()));
+  CUTLASS_CHECK(gemm_ref.run());
+
+  // compare_reference
+  ElementD const epsilon(1e-2f);
+  ElementD const non_zero_floor(1e-4f);
+  bool passed = cutlass::reference::device::BlockCompareRelativelyEqual(block_ref_D.get(), block_D.get(), block_D.size(), epsilon, non_zero_floor);
+
+  return passed;
+}
+
+/// Execute a given example GEMM computation
+template <typename Gemm>
+int run(Options &options)
+{
+  initialize(options);
+
+  // Instantiate CUTLASS kernel depending on templates
+  Gemm gemm;
+
+  // Create a structure of gemm kernel arguments suitable for invoking an instance of Gemm
+  auto arguments = args_from_options<typename Gemm::Arguments>(options);
+
+  // Using the arguments, query for extra workspace required for matrix multiplication computation
+  size_t workspace_size = Gemm::get_workspace_size(arguments);
+
+  // Allocate workspace memory
+  cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+
+  // Check if the problem size is supported or not
+  CUTLASS_CHECK(gemm.can_implement(arguments));
+
+  // Initialize CUTLASS kernel with arguments and workspace pointer
+  CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+
+  // Correctness / Warmup iteration
+  CUTLASS_CHECK(gemm.run());
+
+  // Check if output from CUTLASS kernel and reference kernel are equal or not
+  Result result;
+  result.passed = verify(options);
+
+  std::cout << "  Disposition: " << (result.passed ? "Passed" : "Failed") << std::endl;
+
+  if (!result.passed) {
+    exit(-1);
+  }
+
+  // Run profiling loop
+  if (options.iterations > 0)
+  {
+    GpuTimer timer;
+    timer.start();
+    for (int iter = 0; iter < options.iterations; ++iter) {
+      CUTLASS_CHECK(gemm.run());
+    }
+    timer.stop();
+
+    // Compute average runtime and GFLOPs.
+    float elapsed_ms = timer.elapsed_millis();
+    result.avg_runtime_ms = double(elapsed_ms) / double(options.iterations);
+    result.gflops = options.gflops(result.avg_runtime_ms / 1000.0);
+
+    std::cout << "  Problem Size: " << options.m << 'x' << options.n << 'x' << options.k << 'x' << options.l << std::endl;
+    std::cout << "  Avg runtime: " << result.avg_runtime_ms << " ms" << std::endl;
+    std::cout << "  GFLOPS: " << result.gflops << std::endl;
+  }
+
+  return 0;
+}
+
+#endif // defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+int main(int argc, char const **args) {
+
+  // CUTLASS must be compiled with CUDA 12.0 Toolkit to run this example
+  // and must have compute capability at least 90.
+  if (__CUDACC_VER_MAJOR__ < 12) {
+    std::cerr << "This example requires CUDA 12 or newer.\n";
+    // Returning zero so this test passes on older Toolkits. Its actions are no-op.
+    return 0;
+  }
+
+  cudaDeviceProp props;
+  int current_device_id;
+  CUDA_CHECK(cudaGetDevice(&current_device_id));
+  CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
+  cudaError_t error = cudaGetDeviceProperties(&props, 0);
+  if (props.major < 9) {
+    std::cerr
+      << "This example requires a GPU of NVIDIA's Hopper Architecture or "
+      << "later (compute capability 90 or greater).\n";
+    return 0;
+  }
+  // {$nv-internal-release begin}
+  else if (props.major != 9 || props.minor != 0) {
+    std::cerr << "This example requires a GPU of NVIDIA's Hopper Architecture (compute capability 90).\n";
+    return 0;
+  }
+  // {$nv-internal-release end}
+
+  //
+  // Parse options
+  //
+
+  Options options;
+
+  options.parse(argc, args);
+
+  if (options.help) {
+    options.print_usage(std::cout) << std::endl;
+    return 0;
+  }
+
+  //
+  // Evaluate CUTLASS kernels
+  //
+
+#if defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+  if (options.g == options.k) {
+    std::cout << "Running in per-column scale mode." << std::endl;
+  } else {
+    std::cout << "Running in group scale mode." << std::endl;
+  }
+  run<GemmScaleOnly>(options);
+#endif
+
+  return 0;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
\ No newline at end of file
diff --git a/examples/55_hopper_mixed_dtype_gemm/55_hopper_int4_fp8_gemm.cu b/examples/55_hopper_mixed_dtype_gemm/55_hopper_int4_fp8_gemm.cu
new file mode 100644
index 0000000000..eee10e01f2
--- /dev/null
+++ b/examples/55_hopper_mixed_dtype_gemm/55_hopper_int4_fp8_gemm.cu
@@ -0,0 +1,744 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+/*! \file
+    \brief Hopper GEMM example with different data types using CUTLASS 3.0 APIs for NVIDIA Hopper architecture
+
+    This example shows how to perform INT4 x FP8 GEMM and scale up the INT4 weight during dequantization. It uses a look-up table to avoid the multiplications
+    between INT4 and FP8. To trigger this method, use cutlass::Array<ElementScale, 8> as the scale type in the collective's arguments.
+    
+    However, this algorithm requires changes to the encoding of INT4 weights and scale factors. These changes must happen before launching the GEMM. See the helper functions
+    `unify_quant_encoding`, `initialize_packed_scale`, and header `fp8_packed_scale.hpp` for details.
+
+    In a nutshell, the positive values of INT4 weights need to be encoded in the same way as negative values except for the sign bit. For each scale factor,
+    8 negative results (-8 x scale, -7 x scale, ... -1 x scale) are packed together, forming a cutlass::Array<ElementScale, 8> value.
+
+    The narrower type always passes through the register file. Therefore, in cases where the narrower type is operand B, the collective will implicitly swap 
+    A and B in the main loop. However, as a result of this collective performing implicit swaps, it does not support TMA epilogues. Consequently, it is essential to consider this when constructing the epilogue, 
+    as illustrated in this example.
+
+    Note that in this example, we explicitly swap A and B in order to use TMA epilogues. We do this since TMA epilogues are more performant on problem sizes of interest.
+
+    As an additional optimization, we can reorder the narrow data type tensor such that elements read into register file by the same thread are contiguous in global and shared memory.
+    This promotes vectorization of shared memory loads and removes additional instructions on the critical path. For example, when MMA is performed in FP8 data type, each thread reads
+    4 groups of 4 elements that are logically contiguous in the same row (refer to https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#wgmma-64n32-a for thread-value layout).
+    If the narrow type is INT4 and tensor is major in K dim, only 16 bits can be read at a time, leading to extra load instructions and suboptimal utilization of shared memory throughput.
+    If we reorder the data offline to place all 16 elements read by a thread contiguously in memory, a single 64-bit load is sufficient. This reordering is often feasible when the quantized
+    tensor is static (e.g. weight tensor of a NN layer at inference time). This example demonstrates how such a reordering can be performed and communicated to the kernel when the macro
+    OPTIMIZE_WEIGHT_LAYOUT is set to 1.
+
+    It is expected that the scale's K dimension be scale_k = ceil_div(problem_k, group_size). 
+    
+    Scales are always expected to be MN major. This means the fastest changing dimension must be M if A is scaled or N if B is scaled.
+    
+    If A is being scaled, the scales must have shape [M, scale_k],  while if B is scaled, it must have shape [N, scale_k].
+
+    The implementation only supports "group-wise" scales. However, we can make it work for per-column scales by setting the group's size
+    equal to the gemm problem K.
+
+    Limitations:
+      1) Only supports INT4 x { FP8, INT8, UINT8 }. The scales must be the same as mma Type. Scale with zero-point mode is not supported.
+      2) The INT4 weights and scale factors have additional encoding requirements.
+      3) The scales must be MN major. That means if A is scaled, it must be column major, but if B is scaled it must be row major.
+      4) The scales must have the same layout and groupsize.
+      5) The groupsize must be greater or equal to the tile shape k.
+      6) Currently, TMA epilogues cannot be used when the narrow type is the B operand. This limitation arises because the implementation always swaps the 
+         operands to ensure that the narrow type passes through the register file, and TMA epilogues do not currently support implicit swap + transpose operations. 
+         We plan to address this limitation in the future. However, we address this in the example by explicitly swapping and transposing the operands.
+    
+    Optimizing suggestions:
+      1) Use a small tile size, since the register pressure for this GEMM (and RS GEMM in general) is high (it uses a lot of register space).
+
+    Examples:
+      
+      Runs the mixed input batched gemm (with batch size 2), converting B to the type of A (mode 0)
+      $ ./examples/55_hopper_mixed_dtype_gemm/55_hopper_int4_fp8_gemm --m=2048 --n=2048 --k=2048 --l=2 --mode=0
+
+      Runs the mixed input gemm, and applies a scaling factor to B before mma (mode 1). Applies a vector of scales to the entire
+      matrix (group size is the same as the gemm k dimension).
+      $ ./examples/55_hopper_mixed_dtype_gemm/55_hopper_int4_fp8_gemm --m=4096 --n=5120 --k=8192 --g=8192 --mode=1
+*/
+
+#include <iostream>
+
+#include "cutlass/cutlass.h"
+
+#include "cute/tensor.hpp"
+#include "cutlass/tensor_ref.h"
+#include "cutlass/epilogue/collective/default_epilogue.hpp"
+#include "cutlass/epilogue/thread/linear_combination.h"
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+
+#include "cutlass/util/command_line.h"
+#include "cutlass/util/distribution.h"
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/packed_stride.hpp"
+#include "cutlass/util/tensor_view_io.h"
+#include "cutlass/util/reference/device/tensor_fill.h"
+#include "cutlass/util/reference/device/tensor_compare.h"
+
+#include "helper.h"
+#include "unfused_weight_dequantize.hpp"
+#include "packed_scale.hpp"
+#include "reorder_utils.hpp"
+
+using namespace cute;
+
+#define OPTIMIZE_WEIGHT_LAYOUT 1
+
+#if defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM kernel configurations
+/////////////////////////////////////////////////////////////////////////////////////////////////
+using MmaType = cutlass::float_e4m3_t;
+using QuantType = cutlass::int4b_t;
+constexpr int TileShapeK = 128 * 8 / sizeof_bits<MmaType>::value;
+
+// A matrix configuration
+using         ElementA    = MmaType;                                        // Element type for A matrix operand
+using         LayoutA     = cutlass::layout::RowMajor;                      // Layout type for A matrix operand
+constexpr int AlignmentA  = 128 / cutlass::sizeof_bits<ElementA>::value;    // Memory access granularity/alignment of A matrix in units of elements (up to 16 bytes)
+
+// B matrix configuration
+using         ElementB    = QuantType;                                      // Element type for B matrix operand
+using         LayoutB     = cutlass::layout::ColumnMajor;                   // Layout type for B matrix operand
+constexpr int AlignmentB  = 128 / cutlass::sizeof_bits<ElementB>::value;    // Memory access granularity/alignment of B matrix in units of elements (up to 16 bytes)
+
+// This example manually swaps and transposes, so keep transpose of input layouts
+using LayoutA_Transpose = typename cutlass::layout::LayoutTranspose<LayoutA>::type;
+using LayoutB_Transpose = typename cutlass::layout::LayoutTranspose<LayoutB>::type;
+
+using StrideA = cutlass::detail::TagToStrideA_t<LayoutA>;
+using StrideB = cutlass::detail::TagToStrideB_t<LayoutB>;
+
+#if OPTIMIZE_WEIGHT_LAYOUT
+// Define the CuTe layout for reoredered quantized tensor B
+// LayoutAtomQuant places values that will be read by the same thread in contiguous locations in global memory.
+// It specifies the reordering within a single warp's fragment
+using LayoutAtomQuant = decltype(compute_memory_reordering_atom<MmaType>());
+using LayoutB_Reordered = decltype(tile_to_shape(LayoutAtomQuant{}, Layout<Shape<int,int,int>, StrideB>{}));
+#endif
+
+using ElementScale = MmaType;
+using ElementZero = ElementScale; // only for verify
+using LayoutScale = cutlass::layout::RowMajor;
+
+// C/D matrix configuration
+using         ElementC    = cutlass::half_t;                                // Element type for C and D matrix operands
+using         LayoutC     = cutlass::layout::RowMajor;                      // Layout type for C and D matrix operands
+constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;    // Memory access granularity/alignment of C matrix in units of elements (up to 16 bytes)
+
+// D matrix configuration
+using         ElementD    = ElementC;
+using         LayoutD     = LayoutC;
+constexpr int AlignmentD  = 128 / cutlass::sizeof_bits<ElementD>::value;
+
+// Core kernel configurations
+using ElementAccumulator  = float;                                          // Element type for internal accumulation
+using ElementCompute      = float;                                          // Element type for epilogue computation
+using ArchTag             = cutlass::arch::Sm90;                            // Tag indicating the minimum SM that supports the intended feature
+using OperatorClass       = cutlass::arch::OpClassTensorOp;                 // Operator class tag
+using TileShape           = Shape<_128,_128,cute::Int<TileShapeK>>;         // Threadblock-level tile size
+using ClusterShape        = Shape<_1,_1,_1>;                                // Shape of the threadblocks in a cluster
+using KernelSchedule      = cutlass::gemm::KernelTmaWarpSpecializedCooperativeMixedInput;  // Kernel to launch based on the default setting in the Collective Builder 
+using EpilogueSchedule    = cutlass::epilogue::TmaWarpSpecializedCooperative;
+using EpilogueTileType    = cutlass::epilogue::collective::EpilogueTileAuto;
+
+using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+    cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+    TileShape, ClusterShape,
+    EpilogueTileType,
+    ElementAccumulator, ElementAccumulator,
+    // Transpose layout of D here since we use explicit swap + transpose
+    // the void type for C tells the builder to allocate 0 smem for the C matrix.
+    // We can enable this if beta == 0 by changing ElementC to void below.
+    ElementC, typename cutlass::layout::LayoutTranspose<LayoutC>::type, AlignmentC,
+    ElementD, typename cutlass::layout::LayoutTranspose<LayoutD>::type, AlignmentD,
+    EpilogueSchedule // This is the only epi supporting the required swap + transpose.
+  >::CollectiveOp;
+
+// =========================================================== MIXED INPUT WITH SCALES ===========================================================================
+// The Scale information must get paired with the operand that will be scaled. In this example, B is scaled so we make a tuple of B's information and the scale information.
+using CollectiveMainloopScaleOnly = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+#if OPTIMIZE_WEIGHT_LAYOUT
+    cute::tuple<ElementB, cutlass::Array<ElementScale, 8>>, LayoutB_Reordered, AlignmentB,
+#else
+    cute::tuple<ElementB, cutlass::Array<ElementScale, 8>>, LayoutB_Transpose, AlignmentB,
+#endif
+    ElementA, LayoutA_Transpose, AlignmentA,
+    ElementAccumulator,
+    TileShape, ClusterShape,
+    cutlass::gemm::collective::StageCountAutoCarveout<
+      static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))
+    >,
+    KernelSchedule
+  >::CollectiveOp;
+
+using GemmKernelScaleOnly = cutlass::gemm::kernel::GemmUniversal<
+    Shape<int,int,int,int>, // Indicates ProblemShape
+    CollectiveMainloopScaleOnly,
+    CollectiveEpilogue
+>;
+
+using GemmScaleOnly = cutlass::gemm::device::GemmUniversalAdapter<GemmKernelScaleOnly>;
+
+using StrideC = typename GemmKernelScaleOnly::StrideC;
+using StrideD = typename GemmKernelScaleOnly::StrideD;
+
+using StrideC_ref = cutlass::detail::TagToStrideC_t<LayoutC>;
+using StrideD_ref = cutlass::detail::TagToStrideC_t<LayoutD>;
+
+//
+// Data members
+//
+
+/// Initialization
+StrideA stride_A;
+StrideB stride_B;
+StrideC stride_C;
+StrideC_ref stride_C_ref;
+StrideD stride_D;
+StrideD_ref stride_D_ref;
+uint64_t seed;
+
+#if OPTIMIZE_WEIGHT_LAYOUT
+LayoutB_Reordered layout_B_reordered;
+#endif
+
+using StrideS = typename CollectiveMainloopScaleOnly::StrideScale;
+using StrideS_ref = cutlass::detail::TagToStrideB_t<LayoutScale>;
+StrideS stride_S;
+StrideS_ref stride_S_ref;
+
+cutlass::DeviceAllocation<ElementA> block_A;
+cutlass::DeviceAllocation<ElementB> block_B;
+cutlass::DeviceAllocation<ElementB> block_B_modified;
+cutlass::DeviceAllocation<ElementA> block_B_dq;
+cutlass::DeviceAllocation<ElementScale> block_scale;
+cutlass::DeviceAllocation<cutlass::Array<ElementScale, 8>> block_scale_packed;
+cutlass::DeviceAllocation<ElementZero> block_zero;
+cutlass::DeviceAllocation<ElementC> block_C;
+cutlass::DeviceAllocation<typename GemmScaleOnly::EpilogueOutputOp::ElementOutput> block_D;
+cutlass::DeviceAllocation<typename GemmScaleOnly::EpilogueOutputOp::ElementOutput> block_ref_D;
+
+#endif // defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Testbed utility types
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// Command line options parsing
+struct Options {
+
+  bool help = false;
+
+  float alpha = 1.0f;
+  float beta = 0.0f;
+  int iterations = 10;
+  int m = 5120, n = 4096, k = 4096;
+  int g = 128;
+  int l = 1;
+
+  // Parses the command line
+  void parse(int argc, char const **args) {
+    cutlass::CommandLine cmd(argc, args);
+
+    if (cmd.check_cmd_line_flag("help")) {
+      help = true;
+      return;
+    }
+
+    cmd.get_cmd_line_argument("m", m);
+    cmd.get_cmd_line_argument("n", n);
+    cmd.get_cmd_line_argument("k", k);
+    cmd.get_cmd_line_argument("l", l);
+    cmd.get_cmd_line_argument("g", g);
+    cmd.get_cmd_line_argument("alpha", alpha, 1.f);
+    cmd.get_cmd_line_argument("beta", beta, 0.f);
+    cmd.get_cmd_line_argument("iterations", iterations);
+  }
+
+  /// Prints the usage statement.
+  std::ostream & print_usage(std::ostream &out) const {
+
+    out << "55_hopper_warp_specialized_gemm\n\n"
+      << "  Hopper FP32 GEMM using a Warp Specialized kernel.\n\n"
+      << "Options:\n\n"
+      << "  --help                      If specified, displays this usage statement\n\n"
+      << "  --m=<int>                   Sets the M extent of the GEMM\n"
+      << "  --n=<int>                   Sets the N extent of the GEMM\n"
+      << "  --k=<int>                   Sets the K extent of the GEMM\n"
+      << "  --l=<int>                   The number of independent gemm problems with mnk shape\n"
+      << "  --g=<int>                   The size of each group for the scales. To broadcast a vector of scales or zeros, set the group size to K.\n"
+      << "  --alpha=<f32>               Epilogue scalar alpha\n"
+      << "  --beta=<f32>                Epilogue scalar beta\n\n"
+      << "  --iterations=<int>          Number of profiling iterations to perform.\n\n";
+
+    out
+      << "\n\nExamples:\n\n"
+      << "$ " << "55_hopper_warp_specialized_gemm" << " --m=1024 --n=512 --k=1024 -g 0 --l=10 --alpha=2 --mode=2 --beta=0.707 \n\n";
+
+    return out;
+  }
+
+  /// Compute performance in GFLOP/s
+  double gflops(double runtime_s) const
+  {
+    // Two flops per multiply-add
+    uint64_t flop = uint64_t(2) * m * n * k * l;
+    double gflop = double(flop) / double(1.0e9);
+    return gflop / runtime_s;
+  }
+};
+
+/// Result structure
+struct Result
+{
+  double avg_runtime_ms = 0.0;
+  double gflops = 0.0;
+  cutlass::Status status = cutlass::Status::kSuccess;
+  cudaError_t error = cudaSuccess;
+  bool passed = false;
+
+};
+
+#if defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM setup and evaluation
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Helper to initialize a block of device data
+template <class Element>
+bool initialize_tensor(
+  cutlass::DeviceAllocation<Element>& block,
+  uint64_t seed=2023) {
+
+  double scope_max, scope_min;
+  int bits_input = cutlass::sizeof_bits<Element>::value;
+  int bits_output = cutlass::sizeof_bits<Element>::value;
+
+  if (bits_input == 1) {
+    scope_max = 2;
+    scope_min = 0;
+  }
+  else if (bits_input <= 8) {
+    scope_max = 2;
+    scope_min = -2;
+  }
+  else if (bits_output == 16) {
+    scope_max = 5;
+    scope_min = -5;
+  }
+  else {
+    scope_max = 8;
+    scope_min = -8;
+  }
+  cutlass::reference::device::BlockFillRandomUniform(
+      block.get(), block.size(), seed, Element(scope_max), Element(scope_min));
+
+  return true;
+}
+
+template <typename Element>
+bool initialize_quant_tensor(
+  cutlass::DeviceAllocation<Element>& block,
+  uint64_t seed=2023) {
+  
+  float scope_min = float(cutlass::platform::numeric_limits<Element>::lowest());
+  float scope_max = float(cutlass::platform::numeric_limits<Element>::max());
+
+  cutlass::reference::device::BlockFillRandomUniform(
+    block.get(), block.size(), seed, Element(scope_max), Element(scope_min));
+
+  return true;
+}
+
+// In the mainloop, PRMT selects 1 byte from only 8 bytes so the sign bit is handled in an extra PRMT.
+// Here the encodings of positive values and negative values are unified (except for the sign bit). 
+// For instance, 1 becomes 0b0111, which is the same encoding as -1 (0b1111).
+bool unify_quant_encoding(
+  cutlass::DeviceAllocation<cutlass::int4b_t> const& block_in,
+  cutlass::DeviceAllocation<cutlass::int4b_t>& block_out) {
+
+  using StorageType = cutlass::int4b_t::Storage;
+
+  if (block_in.size() != block_out.size()) {
+    std::cerr << "block_in and block_out must have same size.\n";
+    return false;
+  }
+  constexpr int pack = sizeof_bits_v<StorageType> / 4;
+  std::vector<StorageType> data(block_in.size() / pack);
+  cutlass::device_memory::copy_to_host(data.data(), (StorageType*)block_in.get(), block_in.size() / pack);
+
+  for (auto&& d : data) {
+    StorageType out = 0;
+    StorageType mask = 0x0f;
+    for (int i = 0; i < pack; ++i) {
+      cutlass::int4b_t curr;
+      curr.storage = (d >> (i * 4)) & 0x0f;
+      switch (curr) {
+        case 1: curr.storage = StorageType(0b0111); break; // 2's complement
+        case 2: curr.storage = StorageType(0b0110); break; // 2's complement
+        case 3: curr.storage = StorageType(0b0101); break; // 2's complement
+        case 4: curr.storage = StorageType(0b0100); break; // 2's complement
+        case 5: curr.storage = StorageType(0b0011); break; // 2's complement
+        case 6: curr.storage = StorageType(0b0010); break; // 2's complement
+        case 7: curr.storage = StorageType(0b0001); break; // 2's complement
+        default: break;
+      }
+      out |= (curr.storage << (4 * i)) & mask;
+      mask <<= 4;
+    }
+    d = out;
+  }
+
+  cutlass::device_memory::copy_to_device((StorageType*)block_out.get(), data.data(), block_out.size() / pack);
+  return true;
+}
+
+template <class Element>
+bool initialize_scale(
+  cutlass::DeviceAllocation<Element>& block, 
+  Options const& options) {
+  
+  float elt_max_f = float(cutlass::platform::numeric_limits<QuantType>::max());
+  float const max_dequant_val = 4.f;
+  float const min_dequant_val = 0.5f;
+
+  float scope_max(max_dequant_val / elt_max_f);
+  float scope_min(min_dequant_val / elt_max_f);
+
+  cutlass::reference::device::BlockFillRandomUniform(
+    block.get(), block.size(), seed, Element(scope_max), Element(scope_min));
+  return true;
+}
+
+bool initialize_packed_scale(
+  cutlass::DeviceAllocation<ElementScale> const& block_in, 
+  cutlass::DeviceAllocation<cutlass::Array<ElementScale, 8> > & block_out) {
+  
+  std::vector<ElementScale> data_in(block_in.size());
+  std::vector<cutlass::Array<ElementScale, 8> > data_out(block_in.size());
+  try {
+    block_in.copy_to_host(data_in.data());
+  } catch (cutlass::cuda_exception const& e)
+  {
+    std::cerr << "CUDA Error: " << cudaGetErrorString(e.cudaError()) << std::endl;
+    return false;
+  }
+  for (size_t i = 0; i < block_in.size(); ++i)
+  {
+    cutlass::packed_scale_t<ElementScale> tmp(data_in[i]);
+    data_out[i] = reinterpret_cast<cutlass::Array<ElementScale, 8> const&>(tmp);
+    // std::cout << data_in[i] << ":" << std::hex << static_cast<uint16_t>(data_in[i].storage) << ",\t" << -data_in[i] << ":" << std::hex << static_cast<uint16_t>((-data_in[i]).storage) << std::endl;
+  }
+  try {
+    block_out.copy_from_host(data_out.data());
+  } catch (cutlass::cuda_exception const& e)
+  {
+    std::cerr << "CUDA Error: " << cudaGetErrorString(e.cudaError()) << std::endl;
+    return false;
+  }
+  return true;
+}
+
+template <class Element>
+bool initialize_zero(
+  cutlass::DeviceAllocation<Element>& block,
+  Options const& options) {
+  std::vector<Element> stage(block.size(), Element(0.0f));
+  block.copy_from_host(stage.data());
+  return true;
+}
+
+/// Initialize operands to be used in the GEMM and reference GEMM
+void initialize(Options const& options) {
+
+  auto shape_B = cute::make_shape(options.n, options.k, options.l);
+  int const scale_k = (options.k + options.g - 1) / options.g;
+  stride_A = cutlass::make_cute_packed_stride(StrideA{}, cute::make_shape(options.m, options.k, options.l));
+  stride_B = cutlass::make_cute_packed_stride(StrideB{}, shape_B);
+  // Reverse stride here due to swap and transpose
+  stride_C = cutlass::make_cute_packed_stride(StrideC{}, cute::make_shape(options.n, options.m, options.l));
+  stride_C_ref = cutlass::make_cute_packed_stride(StrideC_ref{}, cute::make_shape(options.m, options.n, options.l));
+  // Reverse stride here due to swap and transpose
+  stride_D = cutlass::make_cute_packed_stride(StrideD{}, cute::make_shape(options.n, options.m, options.l));
+  stride_D_ref = cutlass::make_cute_packed_stride(StrideD_ref{}, cute::make_shape(options.m, options.n, options.l));
+
+  auto layout_B = make_layout(shape_B, stride_B);
+
+  auto a_coord = cutlass::make_Coord(options.m * options.l, options.k);
+  auto b_coord = cutlass::make_Coord(options.k, options.n * options.l);
+  auto c_coord = cutlass::make_Coord(options.m * options.l, options.n);
+
+  block_A.reset(a_coord.product());
+  block_B.reset(b_coord.product());
+  block_B_modified.reset(b_coord.product());
+  block_B_dq.reset(b_coord.product());
+  block_C.reset(c_coord.product());
+  block_D.reset(c_coord.product());
+  block_ref_D.reset(c_coord.product());
+
+  block_scale.reset(scale_k * options.l * options.n);
+  block_scale_packed.reset(scale_k * options.l * options.n);
+  block_zero.reset(scale_k * options.l * options.n);
+
+  initialize_tensor(block_A, seed + 2022);
+  initialize_quant_tensor(block_B, seed + 2021);
+  unify_quant_encoding(block_B, block_B_modified);
+  initialize_tensor(block_C, seed + 2020);
+  initialize_scale(block_scale, options);
+  initialize_packed_scale(block_scale, block_scale_packed);
+  initialize_zero(block_zero, options);
+
+  auto shape_scale_zero = cute::make_shape(options.n, scale_k, options.l);
+  stride_S = cutlass::make_cute_packed_stride(StrideS{}, cute::make_shape(options.n, scale_k, options.l));
+  stride_S_ref = cutlass::make_cute_packed_stride(StrideS_ref{}, cute::make_shape(options.n, scale_k, options.l));
+  auto layout_scale_zero = make_layout(shape_scale_zero, stride_S_ref);
+
+  dequantize_weight(block_B_dq.get(), block_B.get(), layout_B, block_scale.get(), block_zero.get(), layout_scale_zero, options.g);
+
+  #if OPTIMIZE_WEIGHT_LAYOUT
+  // Repeat the reorder layout atom to tile the whole tensor shape 
+  layout_B_reordered = tile_to_shape(LayoutAtomQuant{}, shape_B);
+  reorder_tensor(block_B_modified.get(), layout_B, layout_B_reordered);
+
+  print("Quantized tensor layout: ");
+  print(layout_B_reordered);
+  print("\n");
+#endif
+}
+
+/// Populates a Gemm::Arguments structure from the given commandline options
+template <typename Args>
+Args args_from_options(Options const& options)
+{
+// Swap the A and B tensors, as well as problem shapes here.
+  
+  return Args {
+    cutlass::gemm::GemmUniversalMode::kGemm,
+    {options.n, options.m, options.k, options.l},
+#if OPTIMIZE_WEIGHT_LAYOUT
+    {block_B_modified.get(), layout_B_reordered, block_A.get(), stride_A, block_scale_packed.get(), stride_S, options.g},
+#else
+    {block_B_modified.get(), stride_B,           block_A.get(), stride_A, block_scale_packed.get(), stride_S, options.g},
+#endif
+    {{options.alpha, options.beta}, block_C.get(), stride_C, block_D.get(), stride_D}
+  };
+}
+
+bool verify(Options const& options) {
+  //
+  // Compute reference output
+  //
+
+  // In this example, we use the GPU default kernels as a reference (unfused scale).
+  // This avoids numerical differences due to different accumulation order.
+
+  // Again, due to numerical differences, we must use fast acc here when the mma type is
+  // FP8 as the fused implementation only supports fast acc at the moment.
+  constexpr bool IsFP8Input = cute::is_same_v<MmaType, cutlass::float_e4m3_t> || cute::is_same_v<MmaType, cutlass::float_e5m2_t>;
+  using FP8Sched = cute::conditional_t<size<0>(TileShape{}) == 64, cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum, cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum>;
+  using ScheduleRef = cute::conditional_t<IsFP8Input, FP8Sched, cutlass::gemm::collective::KernelScheduleAuto>;
+
+  using CollectiveMainloopRef = typename cutlass::gemm::collective::CollectiveBuilder<
+      ArchTag, OperatorClass,
+      MmaType, LayoutA, AlignmentA,
+      MmaType, LayoutB, AlignmentB,
+      ElementAccumulator,
+      TileShape, ClusterShape,
+      cutlass::gemm::collective::StageCountAuto,
+      ScheduleRef
+    >::CollectiveOp;
+
+  using CollectiveEpilogueRef = typename cutlass::epilogue::collective::CollectiveBuilder<
+      cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+      TileShape, ClusterShape,
+      cutlass::epilogue::collective::EpilogueTileAuto,
+      ElementAccumulator, ElementAccumulator,
+      ElementC, LayoutC, AlignmentC,
+      ElementD, LayoutD, AlignmentD,
+      cutlass::epilogue::NoSmemWarpSpecialized
+    >::CollectiveOp;
+
+  using GemmKernelRef = cutlass::gemm::kernel::GemmUniversal<
+      Shape<int,int,int,int>, // Indicates ProblemShape
+      CollectiveMainloopRef,
+      CollectiveEpilogueRef
+  >;
+
+  using GemmRef = cutlass::gemm::device::GemmUniversalAdapter<GemmKernelRef>;
+
+  typename GemmRef::Arguments arguments{
+    cutlass::gemm::GemmUniversalMode::kGemm,
+    {options.m, options.n, options.k, options.l},
+    {block_A.get(), stride_A, block_B_dq.get(), stride_B},
+    {{options.alpha, options.beta}, block_C.get(), stride_C_ref, block_ref_D.get(), stride_D_ref}
+  };
+
+  // Run the gemm where the scaling is performed outside of the kernel.
+  GemmRef gemm_ref;
+  size_t workspace_size = GemmRef::get_workspace_size(arguments);
+  cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+  CUTLASS_CHECK(gemm_ref.can_implement(arguments));
+  CUTLASS_CHECK(gemm_ref.initialize(arguments, workspace.get()));
+  CUTLASS_CHECK(gemm_ref.run());
+
+  // compare_reference
+  ElementD const epsilon(1e-2f);
+  ElementD const non_zero_floor(1e-4f);
+  bool passed = cutlass::reference::device::BlockCompareRelativelyEqual(block_ref_D.get(), block_D.get(), block_D.size(), epsilon, non_zero_floor);
+
+  return passed;
+}
+
+/// Execute a given example GEMM computation
+template <typename Gemm>
+int run(Options &options)
+{
+  initialize(options);
+
+  // Instantiate CUTLASS kernel depending on templates
+  Gemm gemm;
+
+  // Create a structure of gemm kernel arguments suitable for invoking an instance of Gemm
+  auto arguments = args_from_options<typename Gemm::Arguments>(options);
+
+  // Using the arguments, query for extra workspace required for matrix multiplication computation
+  size_t workspace_size = Gemm::get_workspace_size(arguments);
+
+  // Allocate workspace memory
+  cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+
+  // Check if the problem size is supported or not
+  CUTLASS_CHECK(gemm.can_implement(arguments));
+
+  // Initialize CUTLASS kernel with arguments and workspace pointer
+  CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+
+  // Correctness / Warmup iteration
+  CUTLASS_CHECK(gemm.run());
+
+  // Check if output from CUTLASS kernel and reference kernel are equal or not
+  Result result;
+  result.passed = verify(options);
+
+  std::cout << "  Disposition: " << (result.passed ? "Passed" : "Failed") << std::endl;
+
+  if (!result.passed) {
+    exit(-1);
+  }
+
+  // Run profiling loop
+  if (options.iterations > 0)
+  {
+    GpuTimer timer;
+    timer.start();
+    for (int iter = 0; iter < options.iterations; ++iter) {
+      CUTLASS_CHECK(gemm.run());
+    }
+    timer.stop();
+
+    // Compute average runtime and GFLOPs.
+    float elapsed_ms = timer.elapsed_millis();
+    result.avg_runtime_ms = double(elapsed_ms) / double(options.iterations);
+    result.gflops = options.gflops(result.avg_runtime_ms / 1000.0);
+
+    std::cout << "  Problem Size: " << options.m << 'x' << options.n << 'x' << options.k << 'x' << options.l << std::endl;
+    std::cout << "  Avg runtime: " << result.avg_runtime_ms << " ms" << std::endl;
+    std::cout << "  GFLOPS: " << result.gflops << std::endl;
+  }
+
+  return 0;
+}
+
+#endif // defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+int main(int argc, char const **args) {
+
+  // CUTLASS must be compiled with CUDA 12.0 Toolkit to run this example
+  // and must have compute capability at least 90.
+  if (__CUDACC_VER_MAJOR__ < 12) {
+    std::cerr << "This example requires CUDA 12 or newer.\n";
+    // Returning zero so this test passes on older Toolkits. Its actions are no-op.
+    return 0;
+  }
+
+  cudaDeviceProp props;
+  int current_device_id;
+  CUDA_CHECK(cudaGetDevice(&current_device_id));
+  CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
+  cudaError_t error = cudaGetDeviceProperties(&props, 0);
+  if (props.major < 9) {
+    std::cerr
+      << "This example requires a GPU of NVIDIA's Hopper Architecture or "
+      << "later (compute capability 90 or greater).\n";
+    return 0;
+  }
+  //
+  // Parse options
+  //
+
+  Options options;
+
+  options.parse(argc, args);
+
+  if (options.help) {
+    options.print_usage(std::cout) << std::endl;
+    return 0;
+  }
+
+  //
+  // Evaluate CUTLASS kernels
+  //
+
+#if defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+  if (options.g == options.k) {
+    std::cout << "Running in per-column scale mode." << std::endl;
+  } else {
+    std::cout << "Running in group scale mode." << std::endl;
+  }
+  run<GemmScaleOnly>(options);
+#endif
+
+  return 0;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/examples/55_hopper_mixed_dtype_gemm/55_hopper_mixed_dtype_gemm.cu b/examples/55_hopper_mixed_dtype_gemm/55_hopper_mixed_dtype_gemm.cu
index 28baae260c..6a3a8f806f 100644
--- a/examples/55_hopper_mixed_dtype_gemm/55_hopper_mixed_dtype_gemm.cu
+++ b/examples/55_hopper_mixed_dtype_gemm/55_hopper_mixed_dtype_gemm.cu
@@ -53,14 +53,18 @@
     equal to the gemm problem K.
 
     Limitations:
-      1) Only supported combinations are 16-bit x {8-bit, 4-bit, 2-bit} and {8-bit} x {4-bit, 2-bit}.
-      2) The narrow type must always be in K-major format.
-      3) The scales and zeros must be MN major. That means if A is scaled, it must be column major, but if B is scaled it must be row major.
-      4) The scales and the zeros must have the same layout and groupsize.
+      1) The narrow type must always be in K-major format.
+      2) The scales and zeros must be MN major. That means if A is scaled, it must be column major, but if B is scaled it must be row major.
+      3) The scales and the zeros must have the same layout and groupsize.
+      4) The groupsize must be greater or equal to tile shape k.
       5) When dealing with 8-bit x {4-bit, 2-bit}, both inputs must be in K-major format.
       6) Currently, TMA epilogues cannot be used when the narrow type is the B operand. This limitation arises because the implementation always swaps the 
          operands to ensure that the narrow type passes through the register file, and TMA epilogues do not currently support implicit swap + transpose operations. 
          We plan to address this limitation in the future. However, we address this in the example by explicitly swapping and transposing the operands.
+    
+    Optimizing suggestions:
+      1) Use a small tile size, since the register pressure for this GEMM (and RS GEMM in general) is high (it uses a lot of register space).
+      2) Try avoid using scale or zero mode cause the computations will be the bottleneck.
 
     Examples:
       
@@ -94,11 +98,8 @@
 #include "cutlass/util/host_tensor.h"
 #include "cutlass/util/packed_stride.hpp"
 #include "cutlass/util/tensor_view_io.h"
-#include "cutlass/util/reference/host/tensor_fill.h"
-#include "cutlass/util/reference/host/tensor_copy.h"
-#include "cutlass/util/reference/host/tensor_compare.h"
-#include "cutlass/util/reference/host/tensor_norm.h"
-#include "cutlass/util/reference/host/gett.hpp"
+#include "cutlass/util/reference/device/tensor_fill.h"
+#include "cutlass/util/reference/device/tensor_compare.h"
 
 #include "helper.h"
 #include "unfused_weight_dequantize.hpp"
@@ -117,8 +118,8 @@ enum GemmMode {
 /////////////////////////////////////////////////////////////////////////////////////////////////
 /// GEMM kernel configurations
 /////////////////////////////////////////////////////////////////////////////////////////////////
-using MmaType = cutlass::float_e4m3_t;
-using QuantType = cutlass::int4b_t;
+using MmaType = cutlass::half_t;
+using QuantType = cutlass::float_e4m3_t;
 constexpr int TileShapeK = 128 * 8 / sizeof_bits<MmaType>::value;
 
 // A matrix configuration
@@ -154,8 +155,8 @@ using ElementAccumulator  = float;                                          // E
 using ElementCompute      = float;                                          // Element type for epilogue computation
 using ArchTag             = cutlass::arch::Sm90;                            // Tag indicating the minimum SM that supports the intended feature
 using OperatorClass       = cutlass::arch::OpClassTensorOp;                 // Operator class tag
-using TileShape           = Shape<_128,_256,cute::Int<TileShapeK>>;         // Threadblock-level tile size
-using ClusterShape        = Shape<_2,_1,_1>;                                // Shape of the threadblocks in a cluster
+using TileShape           = Shape<_128,_128,cute::Int<TileShapeK>>;         // Threadblock-level tile size
+using ClusterShape        = Shape<_1,_1,_1>;                                // Shape of the threadblocks in a cluster
 using KernelSchedule      = cutlass::gemm::KernelTmaWarpSpecializedCooperativeMixedInput;  // Kernel to launch based on the default setting in the Collective Builder 
 using EpilogueSchedule    = cutlass::epilogue::TmaWarpSpecializedCooperative;
 using EpilogueTileType    = cutlass::epilogue::collective::EpilogueTileAuto;
@@ -268,14 +269,14 @@ using StrideS_ref = cutlass::detail::TagToStrideB_t<LayoutScale>;
 StrideS stride_S;
 StrideS_ref stride_S_ref;
 
-cutlass::HostTensor<MmaType, LayoutA> tensor_A;
-cutlass::HostTensor<QuantType, LayoutB> tensor_B;
-cutlass::HostTensor<MmaType, LayoutB> tensor_B_dq;
-cutlass::HostTensor<ElementScale, LayoutScale> tensor_scale;
-cutlass::HostTensor<ElementZero, LayoutScale> tensor_zero;
-cutlass::HostTensor<ElementC, LayoutC> tensor_C;
-cutlass::HostTensor<typename GemmScaleWithZeroPoint::EpilogueOutputOp::ElementOutput, LayoutD> tensor_D;
-cutlass::HostTensor<typename GemmScaleWithZeroPoint::EpilogueOutputOp::ElementOutput, LayoutD> tensor_ref_D;
+cutlass::DeviceAllocation<ElementA> block_A;
+cutlass::DeviceAllocation<ElementB> block_B;
+cutlass::DeviceAllocation<ElementA> block_B_dq;
+cutlass::DeviceAllocation<ElementScale> block_scale;
+cutlass::DeviceAllocation<ElementZero> block_zero;
+cutlass::DeviceAllocation<ElementC> block_C;
+cutlass::DeviceAllocation<typename GemmScaleWithZeroPoint::EpilogueOutputOp::ElementOutput> block_D;
+cutlass::DeviceAllocation<typename GemmScaleWithZeroPoint::EpilogueOutputOp::ElementOutput> block_ref_D;
 
 #endif // defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
 
@@ -290,7 +291,7 @@ struct Options {
 
   float alpha = 1.0f;
   float beta = 0.0f;
-  int iterations = 1000;
+  int iterations = 10;
   int mode = 2;
   int m = 5120, n = 4096, k = 4096;
   int g = 128;
@@ -368,9 +369,9 @@ struct Result
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
 /// Helper to initialize a block of device data
-template <class Element, class Layout>
+template <class Element>
 bool initialize_tensor(
-  cutlass::TensorView<Element, Layout> view,
+  cutlass::DeviceAllocation<Element>& block,
   uint64_t seed=2023) {
 
   double scope_max, scope_min;
@@ -393,34 +394,35 @@ bool initialize_tensor(
     scope_max = 8;
     scope_min = -8;
   }
-  cutlass::reference::host::TensorFillRandomUniform(
-    view, seed, scope_max, scope_min);
+  cutlass::reference::device::BlockFillRandomUniform(
+      block.get(), block.size(), seed, Element(scope_max), Element(scope_min));
 
   return true;
 }
 
-template <typename Element, typename Layout>
+template <typename Element>
 bool initialize_quant_tensor(
-  cutlass::TensorView<Element, Layout> view,
+  cutlass::DeviceAllocation<Element>& block,
   uint64_t seed=2023) {
   
   float scope_min = float(cutlass::platform::numeric_limits<Element>::lowest());
   float scope_max = float(cutlass::platform::numeric_limits<Element>::max());
 
-  cutlass::reference::host::TensorFillRandomUniform(
-    view, seed, scope_max, scope_min);
+  cutlass::reference::device::BlockFillRandomUniform(
+    block.get(), block.size(), seed, Element(scope_max), Element(scope_min));
 
   return true;
 }
 
-template <class Element, class Layout>
+template <class Element>
 bool initialize_scale(
-  cutlass::TensorView<Element, Layout> view, 
-  const Options &options) {
+  cutlass::DeviceAllocation<Element>& block, 
+  Options const& options) {
   
   if (options.mode == GemmMode::ConvertOnly) {
     // No scales, so just initialize with 1 so we can use the same kernel to dequantize the data.
-    cutlass::reference::host::TensorFill(view, Element(1.0f));
+    std::vector<Element> stage(block.size(), Element(1.0f));
+    block.copy_from_host(stage.data());
   } 
   else {
     float elt_max_f = float(cutlass::platform::numeric_limits<QuantType>::max());
@@ -430,32 +432,33 @@ bool initialize_scale(
     float scope_max(max_dequant_val / elt_max_f);
     float scope_min(min_dequant_val / elt_max_f);
 
-    cutlass::reference::host::TensorFillRandomUniform(
-      view, seed, scope_max, scope_min);
+    cutlass::reference::device::BlockFillRandomUniform(
+      block.get(), block.size(), seed, Element(scope_max), Element(scope_min));
   }
   return true;
 }
 
-template <class Element, class Layout>
+template <class Element>
 bool initialize_zero(
-  cutlass::TensorView<Element, Layout> view,
-  const Options &options) {
+  cutlass::DeviceAllocation<Element>& block,
+  Options const& options) {
   
   if (options.mode == GemmMode::ScaleWithZeroPoint) {
-    cutlass::reference::host::TensorFillRandomUniform(
-      view, seed, 2.0f, -2.0f);
+    cutlass::reference::device::BlockFillRandomUniform(
+      block.get(), block.size(), seed, Element(2.0f), Element(-2.0f));
   } else {
     // No bias, so just initialize with 1 so we can use the same kernel to dequantize the data.
-    cutlass::reference::host::TensorFill(view, Element(0.0f));
+    std::vector<Element> stage(block.size(), Element(0.0f));
+    block.copy_from_host(stage.data());
   }
   return true;
 }
 
 /// Initialize operands to be used in the GEMM and reference GEMM
-void initialize(const Options &options) {
+void initialize(Options const& options) {
 
   auto shape_b = cute::make_shape(options.n, options.k, options.l);
-  const int scale_k = (options.k + options.g - 1) / options.g;
+  int const scale_k = (options.k + options.g - 1) / options.g;
   stride_A = cutlass::make_cute_packed_stride(StrideA{}, cute::make_shape(options.m, options.k, options.l));
   stride_B = cutlass::make_cute_packed_stride(StrideB{}, shape_b);
   // Reverse stride here due to swap and transpose
@@ -469,27 +472,21 @@ void initialize(const Options &options) {
   auto b_coord = cutlass::make_Coord(options.k, options.n * options.l);
   auto c_coord = cutlass::make_Coord(options.m * options.l, options.n);
 
-  tensor_A.resize(a_coord);
-  tensor_B.resize(b_coord);
-  tensor_B_dq.resize(b_coord);
-  tensor_C.resize(c_coord);
-  tensor_D.resize(c_coord);
-  tensor_ref_D.resize(c_coord);
-
-  tensor_scale.resize({scale_k * options.l, options.n});
-  tensor_zero.resize({scale_k * options.l, options.n});
+  block_A.reset(a_coord.product());
+  block_B.reset(b_coord.product());
+  block_B_dq.reset(b_coord.product());
+  block_C.reset(c_coord.product());
+  block_D.reset(c_coord.product());
+  block_ref_D.reset(c_coord.product());
 
-  initialize_tensor(tensor_A.host_view(), seed + 2022);
-  initialize_quant_tensor(tensor_B.host_view(), seed + 2021);
-  initialize_tensor(tensor_C.host_view(), seed + 2020);
-  initialize_scale(tensor_scale.host_view(), options);
-  initialize_zero(tensor_zero.host_view(), options);
+  block_scale.reset(scale_k * options.l * options.n);
+  block_zero.reset(scale_k * options.l * options.n);
 
-  tensor_A.sync_device();
-  tensor_B.sync_device();
-  tensor_C.sync_device();
-  tensor_scale.sync_device();
-  tensor_zero.sync_device();
+  initialize_tensor(block_A, seed + 2022);
+  initialize_quant_tensor(block_B, seed + 2021);
+  initialize_tensor(block_C, seed + 2020);
+  initialize_scale(block_scale, options);
+  initialize_zero(block_zero, options);
 
   auto layout_B = make_layout(shape_b, stride_B);
 
@@ -498,37 +495,36 @@ void initialize(const Options &options) {
   stride_S_ref = cutlass::make_cute_packed_stride(StrideS_ref{}, cute::make_shape(options.n, scale_k, options.l));
   auto layout_scale_zero = make_layout(shape_scale_zero, stride_S_ref);
 
-  dequantize_weight(tensor_B_dq.device_data(), tensor_B.device_data(), layout_B, tensor_scale.device_data(), tensor_zero.device_data(), layout_scale_zero, options.g);
-  tensor_B_dq.sync_host();
+  dequantize_weight(block_B_dq.get(), block_B.get(), layout_B, block_scale.get(), block_zero.get(), layout_scale_zero, options.g);
 }
 
 /// Populates a Gemm::Arguments structure from the given commandline options
 template <typename Args>
-Args args_from_options(const Options &options)
+Args args_from_options(Options const& options)
 {
 // Swap the A and B tensors, as well as problem shapes here.
   if (options.mode == GemmMode::ConvertOnly) {
     return Args {
       cutlass::gemm::GemmUniversalMode::kGemm,
       {options.n, options.m, options.k, options.l},
-      {tensor_B.device_data(), stride_B, tensor_A.device_data(), stride_A},
-      {{options.alpha, options.beta}, tensor_C.device_data(), stride_C, tensor_D.device_data(), stride_D}
+      {block_B.get(), stride_B, block_A.get(), stride_A},
+      {{options.alpha, options.beta}, block_C.get(), stride_C, block_D.get(), stride_D}
     };
   } 
   else if (options.mode == GemmMode::ScaleOnly) {
     return Args {
       cutlass::gemm::GemmUniversalMode::kGemm,
       {options.n, options.m, options.k, options.l},
-      {tensor_B.device_data(), stride_B, tensor_A.device_data(), stride_A, tensor_scale.device_data(), stride_S, options.g},
-      {{options.alpha, options.beta}, tensor_C.device_data(), stride_C, tensor_D.device_data(), stride_D}
+      {block_B.get(), stride_B, block_A.get(), stride_A, block_scale.get(), stride_S, options.g},
+      {{options.alpha, options.beta}, block_C.get(), stride_C, block_D.get(), stride_D}
     };
   } 
   else if (options.mode == GemmMode::ScaleWithZeroPoint) {
     return Args {
       cutlass::gemm::GemmUniversalMode::kGemm,
       {options.n, options.m, options.k, options.l},
-      {tensor_B.device_data(), stride_B, tensor_A.device_data(), stride_A, tensor_scale.device_data(), stride_S, options.g, tensor_zero.device_data()},
-      {{options.alpha, options.beta}, tensor_C.device_data(), stride_C, tensor_D.device_data(), stride_D}
+      {block_B.get(), stride_B, block_A.get(), stride_A, block_scale.get(), stride_S, options.g, block_zero.get()},
+      {{options.alpha, options.beta}, block_C.get(), stride_C, block_D.get(), stride_D}
     };
   } else {
     std::cerr << "Invalid mode " << options.mode << ". Must be 0, 1 or 2." << std::endl;
@@ -542,7 +538,7 @@ bool verify(const Options &options) {
   //
 
   // In this example, we use the GPU default kernels as a reference (unfused scale)
-  // This is to avoid numerical differences from different accumulation order.
+  // This avoids numerical differences due to different accumulation order.
 
   // Again, due to numerical differences, we must use fast acc here when the mma type is
   // FP8 as the fused implementation only supports fast acc at the moment.
@@ -581,8 +577,8 @@ bool verify(const Options &options) {
   typename GemmRef::Arguments arguments{
     cutlass::gemm::GemmUniversalMode::kGemm,
     {options.m, options.n, options.k, options.l},
-    {tensor_A.device_data(), stride_A, tensor_B_dq.device_data(), stride_B},
-    {{options.alpha, options.beta}, tensor_C.device_data(), stride_C_ref, tensor_ref_D.device_data(), stride_D_ref}
+    {block_A.get(), stride_A, block_B_dq.get(), stride_B},
+    {{options.alpha, options.beta}, block_C.get(), stride_C_ref, block_ref_D.get(), stride_D_ref}
   };
 
   // Run the gemm where the scaling is performed outside of the kernel.
@@ -594,11 +590,9 @@ bool verify(const Options &options) {
   CUTLASS_CHECK(gemm_ref.run());
 
   // compare_reference
-  tensor_D.sync_host();
-  tensor_ref_D.sync_host();
-  const ElementD epsilon(1e-2f);
-  const ElementD non_zero_floor(1e-4f);
-  bool passed = cutlass::reference::host::TensorRelativelyEquals(tensor_ref_D.host_view(), tensor_D.host_view(), epsilon, non_zero_floor);
+  ElementD const epsilon(1e-2f);
+  ElementD const non_zero_floor(1e-4f);
+  bool passed = cutlass::reference::device::BlockCompareRelativelyEqual(block_ref_D.get(), block_D.get(), block_D.size(), epsilon, non_zero_floor);
   return passed;
 }
 
@@ -730,4 +724,4 @@ int main(int argc, char const **args) {
   return 0;
 }
 
-/////////////////////////////////////////////////////////////////////////////////////////////////
+/////////////////////////////////////////////////////////////////////////////////////////////////
\ No newline at end of file
diff --git a/examples/55_hopper_mixed_dtype_gemm/CMakeLists.txt b/examples/55_hopper_mixed_dtype_gemm/CMakeLists.txt
index 5ddfbd2e6e..23dca4f3fd 100644
--- a/examples/55_hopper_mixed_dtype_gemm/CMakeLists.txt
+++ b/examples/55_hopper_mixed_dtype_gemm/CMakeLists.txt
@@ -55,5 +55,27 @@ cutlass_example_add_executable(
   TEST_SCALE_ZERO_GROUPED
   TEST_SCALE_RESIDUE
   TEST_SCALE_ZERO_RESIDUE
-  TEST_ALPHA_BETA
+  # TEST_ALPHA_BETA
   )
+
+cutlass_example_add_executable(
+  55_hopper_int4_fp8_gemm
+  55_hopper_int4_fp8_gemm.cu
+  TEST_COMMAND_OPTIONS
+  TEST_DIRECT_BATCHED
+  TEST_SCALE_PERCOL
+  TEST_SCALE_GROUP
+  TEST_SCALE_RESIDUE
+  # TEST_ALPHA_BETA
+  )
+
+  cutlass_example_add_executable(
+    55_hopper_int4_bf16_gemm
+    55_hopper_int4_bf16_gemm.cu
+    TEST_COMMAND_OPTIONS
+    TEST_DIRECT_BATCHED
+    TEST_SCALE_PERCOL
+    TEST_SCALE_GROUP
+    TEST_SCALE_RESIDUE
+    # TEST_ALPHA_BETA
+    )
diff --git a/examples/55_hopper_mixed_dtype_gemm/README.md b/examples/55_hopper_mixed_dtype_gemm/README.md
index 8c393a6b75..07265f0d7e 100644
--- a/examples/55_hopper_mixed_dtype_gemm/README.md
+++ b/examples/55_hopper_mixed_dtype_gemm/README.md
@@ -11,6 +11,8 @@ This first version only supports mixed type GEMMs using TMA.
 
 While the example offers a harness for straightforward benchmarking, this initial implementation isn't optimized for performance in the majority of scenarios. We expect this implementation to be performant for `{fp16, bf16} x {int8, int4}` and `{fp8} x {int4}` for problems that are compute bound. Additionally, we expect good performance for `fp16, bf16` or `fp32` scales and zero-points. For best performance, it is ideal to have the scales and zero-points be the same type.
 
+The scale only mode for `fp8 x int4` is significantly slower than direct conversion mode. There is a lookup-table workaround targeting this mode, as shown in `55_hopper_int4_fp8_gemm.cu`. To use this feature, use `cutlass::Array<ElementScale, 8>` as the scale type in the collective builder. However, it requires modifications to the encoding of quantized weights and scale factors. Also, scale with zero point mode is not supported for now.
+
 We are currently optimizing the following cases:
 1. Memory bound cases for all types
 
diff --git a/examples/55_hopper_mixed_dtype_gemm/packed_scale.hpp b/examples/55_hopper_mixed_dtype_gemm/packed_scale.hpp
new file mode 100644
index 0000000000..7d732dcda7
--- /dev/null
+++ b/examples/55_hopper_mixed_dtype_gemm/packed_scale.hpp
@@ -0,0 +1,131 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+#pragma once
+
+#include <cstdint>
+
+#include "cutlass/float8.h"
+
+namespace cutlass
+{
+template<typename T>
+class packed_scale_t {
+public:
+  static_assert(cute::is_same_v<T, cutlass::int8_t> ||
+                cute::is_same_v<T, cutlass::uint8_t> ||
+                cute::is_same_v<T, cutlass::float_e4m3_t> ||
+                cute::is_same_v<T, cutlass::float_e5m2_t>,
+                "only 8 bit arithmetic types are supported.");
+  CUTLASS_HOST_DEVICE
+  explicit packed_scale_t(T val) {
+    if constexpr (!cute::is_unsigned_v<T>) {
+      // Only pack negative values. The positive values are generated in flight in the mainloop.
+      storage[0] = pack4(T(float(val) * -8.f), T(float(val) * -7.f), T(float(val) * -6.f), T(float(val) * -5.f));
+      storage[1] = pack4(T(float(val) * -4.f), T(float(val) * -3.f), T(float(val) * -2.f), -val);
+    }
+    else {
+      storage[0] = pack4(T(float(val) * 8.f), T(float(val) * 7.f), T(float(val) * 6.f), T(float(val) * 5.f));
+      storage[1] = pack4(T(float(val) * 4.f), T(float(val) * 3.f), T(float(val) * 2.f), val);
+    }
+  }
+  CUTLASS_HOST_DEVICE
+  packed_scale_t() = default;
+  CUTLASS_HOST_DEVICE
+  explicit operator float() const {
+    return float(get());
+  }
+  CUTLASS_HOST_DEVICE
+  bool operator==(packed_scale_t const& rhs) const {
+    return storage[0] == rhs.storage[0] && storage[1] == rhs.storage[1];
+  }
+  CUTLASS_HOST_DEVICE
+  bool operator!=(packed_scale_t const& rhs) const {
+    return !(*this == rhs);
+  }
+  CUTLASS_HOST_DEVICE
+  friend packed_scale_t operator+(packed_scale_t const& lhs, packed_scale_t const& rhs) {
+    return packed_scale_t(lhs.get() + rhs.get());
+  }
+  CUTLASS_HOST_DEVICE
+  friend packed_scale_t operator-(packed_scale_t const& lhs, packed_scale_t const& rhs) {
+    return packed_scale_t(lhs.get() - rhs.get());
+  }
+  CUTLASS_HOST_DEVICE
+  friend packed_scale_t operator*(packed_scale_t const& lhs, packed_scale_t const& rhs) {
+    return packed_scale_t(lhs.get() * rhs.get());
+  }
+  CUTLASS_HOST_DEVICE
+  friend packed_scale_t operator/(packed_scale_t const& lhs, packed_scale_t const& rhs) {
+    return packed_scale_t(lhs.get() / rhs.get());
+  }
+
+private:
+  using Storage = uint32_t;
+  using Stage = uint8_t;
+
+  Storage storage[2] {};
+
+  CUTLASS_HOST_DEVICE
+  static Storage pack4(T c1, T c2, T c3, T c4) {
+    Storage result = 0;
+    result |= (static_cast<Storage>(reinterpret_cast<Stage const&>(c4)) << 24);
+    result |= (static_cast<Storage>(reinterpret_cast<Stage const&>(c3)) << 16);
+    result |= (static_cast<Storage>(reinterpret_cast<Stage const&>(c2)) << 8);
+    result |= static_cast<Storage>(reinterpret_cast<Stage const&>(c1));
+    return result;
+  }
+  CUTLASS_HOST_DEVICE
+  T get() const {
+    auto stage = static_cast<Stage>(storage[0] >> 8);
+    #if defined(__CUDA_ARCH__)
+    return reinterpret_cast<T const&>(stage);
+    #else
+    T tmp;
+    std::memcpy(&tmp, &stage, sizeof(Stage));
+    return tmp;
+    #endif
+  }
+  CUTLASS_HOST_DEVICE
+  T get(int idx) const {
+    Stage stage;
+    if (idx < 4) stage = static_cast<Stage>(storage[0] >> (8 * idx));
+    else         stage = static_cast<Stage>(storage[1] >> (8 * idx - 32));
+    #if defined(__CUDA_ARCH__)
+    return reinterpret_cast<T const&>(stage);
+    #else
+    T tmp;
+    std::memcpy(&tmp, &stage, sizeof(Stage));
+    return tmp;
+    #endif
+  }
+};
+}
diff --git a/examples/55_hopper_mixed_dtype_gemm/reorder_utils.hpp b/examples/55_hopper_mixed_dtype_gemm/reorder_utils.hpp
new file mode 100644
index 0000000000..2be425514a
--- /dev/null
+++ b/examples/55_hopper_mixed_dtype_gemm/reorder_utils.hpp
@@ -0,0 +1,122 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+#include "cute/layout.hpp"
+#include "cute/tensor.hpp"
+
+#include "cutlass/util/device_memory.h"
+
+// Given a type of MMA instruction, compute a memory reordering atom that places all values
+// owned by each thread in contiguous memory locations. This improves smem load vectorization,
+// particularly for mixed dtype GEMMs where a narrow type is loaded in the thread/value order
+// of the wider type and may result in inefficient sub-bank (8-bit or 16-bit) accesses.
+template<class MmaType>
+auto compute_memory_reordering_atom()
+{
+  using namespace cute;
+
+  // 1. Choose an MMA atom to access TV layout and MN shape
+  // Note: parameters like GMMA Major, TileShape, ElementC don't affect TV layout of A, use arbitrary
+  using MmaAtom = decltype(SM90::GMMA::rs_op_selector<MmaType, MmaType, float, Shape<_64,_16,_32>>());
+  using MmaTraits = MMA_Traits<MmaAtom>;
+  auto shape_MK = select<0,2>(typename MmaTraits::Shape_MNK{});
+  auto tv_layout_mma = typename MmaTraits::ALayout{};
+
+  // 2. Create a single warp's TV layout from that of the whole MMA
+  // Note: this assumes A is partitioned between warps along M mode
+  auto tile_TV_warp = make_shape(Int<32>{}, size<1>(tv_layout_mma));
+  auto tv_layout_mma_warp = make_layout_like(composition(tv_layout_mma, tile_TV_warp));
+
+  // 3. Invert warp's TV layout to get MK layout (m,k -> thr,val)
+  auto shape_MK_warp = shape_div(shape_MK, size(typename MmaTraits::ThrID{}) / Int<32>{});
+  auto mk_layout_mma_warp = right_inverse(tv_layout_mma_warp).with_shape(shape_MK_warp);
+
+  // 4. Compose with a contiguous layout of values in each thread (required for smem vectorization)
+  auto tv_to_offset = make_ordered_layout(shape(tv_layout_mma_warp), Step<_1,_0>{});
+  auto layout_atom = composition(tv_to_offset, mk_layout_mma_warp);
+
+  return layout_atom;
+}
+
+template<class EngineSrc, class LayoutSrc, class EngineDst, class LayoutDst>
+__global__ void reorder_tensor_kernel(
+  cute::Tensor<EngineSrc, LayoutSrc> src,
+  cute::Tensor<EngineDst, LayoutDst> dst)
+{
+  auto i = blockIdx.x;
+  auto k = blockIdx.y;
+  for (int j = threadIdx.x; j < cute::size<1>(src); j += blockDim.x) {
+    dst(i,j,k) = src(i,j,k);
+  }
+}
+
+template<class EngineSrc, class LayoutSrc, class EngineDst, class LayoutDst>
+void reorder_tensor(
+  cute::Tensor<EngineSrc, LayoutSrc> t_src,
+  cute::Tensor<EngineDst, LayoutDst> t_dst)
+{
+  using T = typename EngineDst::value_type;
+  static_assert(cute::is_same_v<cute::remove_const_t<typename EngineSrc::value_type>, T>, "Type mismatch");
+  using V = cute::uint_bit_t<cute::max(8, cute::sizeof_bits_v<T>)>;
+
+  cute::Tensor v_src = cute::recast<V>(t_src);
+  cute::Tensor v_dst = cute::recast<V>(t_dst);
+
+  int threads = 256;
+  dim3 blocks{unsigned(cute::size<0>(v_src)), unsigned(cute::size<2>(v_src)), 1u};
+
+  reorder_tensor_kernel<<<blocks, threads>>>(v_src, v_dst);
+  CUDA_CHECK(cudaDeviceSynchronize());
+}
+
+// In-place version
+template<class T, class LayoutSrc, class LayoutDst>
+void reorder_tensor(
+  T const* src,
+  LayoutSrc const& layout_src,
+  T * dst,
+  LayoutDst const& layout_dst)
+{
+  reorder_tensor(make_tensor(src, layout_src),
+                 make_tensor(dst, layout_dst));
+}
+
+// In-place version
+template<class T, class LayoutSrc, class LayoutDst>
+void reorder_tensor(
+  T * data,
+  LayoutSrc const& layout_src,
+  LayoutDst const& layout_dst)
+{
+  cutlass::DeviceAllocation<T> temp(cute::size(layout_src));
+  reorder_tensor(data, layout_src, temp.get(), layout_dst);
+  cutlass::device_memory::copy_device_to_device(data, temp.get(), static_cast<size_t>(cute::size(layout_src)));
+}
\ No newline at end of file
diff --git a/examples/56_hopper_ptr_array_batched_gemm/56_hopper_ptr_array_batched_gemm.cu b/examples/56_hopper_ptr_array_batched_gemm/56_hopper_ptr_array_batched_gemm.cu
index 7a191ce2d8..51ce970dbd 100644
--- a/examples/56_hopper_ptr_array_batched_gemm/56_hopper_ptr_array_batched_gemm.cu
+++ b/examples/56_hopper_ptr_array_batched_gemm/56_hopper_ptr_array_batched_gemm.cu
@@ -32,7 +32,7 @@
 /*! \file
     \brief Hopper Ptr-Array Batched GEMM example using CUTLASS 3 APIs for NVIDIA Hopper architecture.
 
-    This example demonstrates an implementation of Ptr-Array Batched GEMM using a TMA + GMMA 
+    This example demonstrates an implementation of Ptr-Array Batched GEMM using a TMA + GMMA
     warp-specialized cooperative kernel.
     The new feature showcased in this example is on-the-fly modification of TMA descriptors
     to move between batches (represented by l).
@@ -95,40 +95,66 @@ constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;    // M
 using ElementAccumulator  = float;                                          // Element type for internal accumulation
 using ArchTag             = cutlass::arch::Sm90;                            // Tag indicating the minimum SM that supports the intended feature
 using OperatorClass       = cutlass::arch::OpClassTensorOp;                 // Operator class tag
-using TileShape           = Shape<_256,_128,_64>;                           // Threadblock-level tile size
-using ClusterShape        = Shape<_1,_2,_1>;                                // Shape of the threadblocks in a cluster
 using StageCountType = cutlass::gemm::collective::StageCountAuto;           // Stage count maximized based on the tile size
-using KernelSchedule = cutlass::gemm::KernelPtrArrayTmaWarpSpecializedCooperative; // Kernel to launch
-using EpilogueSchedule = cutlass::epilogue::PtrArrayTmaWarpSpecializedCooperative; // Epilogue to launch
-
-using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
-    cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
-    TileShape, ClusterShape,
-    cutlass::epilogue::collective::EpilogueTileAuto,
-    ElementAccumulator, ElementAccumulator,
-    ElementC, LayoutC, AlignmentC,
-    ElementC, LayoutC, AlignmentC,
-    EpilogueSchedule
-  >::CollectiveOp;
-
-using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
-    ArchTag, OperatorClass,
-    ElementA, LayoutA, AlignmentA,
-    ElementB, LayoutB, AlignmentB,
-    ElementAccumulator,
-    TileShape, ClusterShape,
-    cutlass::gemm::collective::StageCountAutoCarveout<
-      static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
-    KernelSchedule
-  >::CollectiveOp;
-
-using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
-    cutlass::gemm::ArrayProblemShape<Shape<int,int,int,int>>,
-    CollectiveMainloop,
-    CollectiveEpilogue
->;
-
-using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+
+// Different configs for pingpong/cooperative
+struct CooperativeConfig {
+  using KernelSchedule = cutlass::gemm::KernelPtrArrayTmaWarpSpecializedCooperative;
+  using EpilogueSchedule = cutlass::epilogue::PtrArrayTmaWarpSpecializedCooperative;
+  using TileShape           = Shape<_256,_128,_64>;
+  using ClusterShape        = Shape<_1,_2,_1>;
+};
+
+struct PingpongConfig {
+  using KernelSchedule = cutlass::gemm::KernelPtrArrayTmaWarpSpecializedPingpong;
+  using EpilogueSchedule = cutlass::epilogue::PtrArrayTmaWarpSpecializedPingpong;
+  using TileShape           = Shape<_64,_128,_64>;
+  using ClusterShape        = Shape<_1,_1,_1>;
+};
+
+template <typename ScheduleConfig>
+struct GemmGivenSchedule {
+  using TileShape           = typename ScheduleConfig::TileShape;                   // Threadblock-level tile size
+  using ClusterShape        = typename ScheduleConfig::ClusterShape;                // Shape of the threadblocks in a cluster
+  using KernelSchedule      = typename ScheduleConfig::KernelSchedule;              // Kernel to launch
+  using EpilogueSchedule    = typename ScheduleConfig::EpilogueSchedule;            // Epilogue to launch
+
+  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+      cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+      TileShape, ClusterShape,
+      cutlass::epilogue::collective::EpilogueTileAuto,
+      ElementAccumulator, ElementAccumulator,
+      ElementC, LayoutC, AlignmentC,
+      ElementC, LayoutC, AlignmentC,
+      EpilogueSchedule
+    >::CollectiveOp;
+
+  using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+      ArchTag, OperatorClass,
+      ElementA, LayoutA, AlignmentA,
+      ElementB, LayoutB, AlignmentB,
+      ElementAccumulator,
+      TileShape, ClusterShape,
+      cutlass::gemm::collective::StageCountAutoCarveout<
+        static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+      KernelSchedule
+    >::CollectiveOp;
+
+  using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+      cutlass::gemm::ArrayProblemShape<Shape<int,int,int,int>>,
+      CollectiveMainloop,
+      CollectiveEpilogue
+  >;
+
+  using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+};
+
+using GemmKernel = GemmGivenSchedule<CooperativeConfig>::GemmKernel;
+using Gemm = GemmGivenSchedule<CooperativeConfig>::Gemm;
+
+using GemmKernelPingpong = GemmGivenSchedule<PingpongConfig>::GemmKernel;
+using GemmPingpong = GemmGivenSchedule<PingpongConfig>::Gemm;
+
 
 // Reference device GEMM implementation type
 using DeviceGemmReference = cutlass::reference::device::Gemm<
@@ -261,14 +287,14 @@ bool initialize_block(
   int bits_input = cutlass::sizeof_bits<Element>::value;
 
   if (bits_input == 1) {
-    scope_max = 2;
-    scope_min = 0;
+    scope_max = static_cast<Element>(2);
+    scope_min = static_cast<Element>(0);
   } else if (bits_input <= 8) {
-    scope_max = 2;
-    scope_min = -2;
+    scope_max = static_cast<Element>(2);
+    scope_min = static_cast<Element>(-2);
   } else {
-    scope_max = 8;
-    scope_min = -8;
+    scope_max = static_cast<Element>(8);
+    scope_min = static_cast<Element>(-8);
   }
 
   cutlass::reference::device::BlockFillRandomUniform(
@@ -351,7 +377,8 @@ void initialize(const Options &options) {
 }
 
 /// Populates a Gemm::Arguments structure from the given commandline options
-typename Gemm::Arguments args_from_options(const Options &options)
+template <typename GemmT>
+typename GemmT::Arguments args_from_options(const Options &options)
 {
   cutlass::KernelHardwareInfo hw_info;
   // Change device_id to another value if you are running on a machine with multiple GPUs and wish
@@ -359,7 +386,7 @@ typename Gemm::Arguments args_from_options(const Options &options)
   hw_info.device_id = 0;
   hw_info.sm_count = cutlass::KernelHardwareInfo::query_device_multiprocessor_count(hw_info.device_id);
 
-  typename Gemm::Arguments arguments{
+  typename GemmT::Arguments arguments{
     cutlass::gemm::GemmUniversalMode::kArray,
     {{options.m, options.n, options.k, options.l}},
     {ptr_A.get(), stride_A, ptr_B.get(), stride_B},
@@ -405,20 +432,20 @@ bool verify(const Options &options) {
 }
 
 /// Execute a given example GEMM computation
-template <typename Gemm>
+template <typename GemmT>
 int run(Options &options)
 {
   allocate(options);
   initialize(options);
 
   // Instantiate CUTLASS kernel depending on templates
-  Gemm gemm;
+  GemmT gemm;
 
   // Create a structure of gemm kernel arguments suitable for invoking an instance of Gemm
-  auto arguments = args_from_options(options);
+  auto arguments = args_from_options<GemmT>(options);
 
   // Using the arguments, query for extra workspace required for matrix multiplication computation
-  size_t workspace_size = Gemm::get_workspace_size(arguments);
+  size_t workspace_size = GemmT::get_workspace_size(arguments);
 
   // Allocate workspace memory
   cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
@@ -510,10 +537,14 @@ int main(int argc, char const **args) {
   //
 
 #if defined(CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90_SUPPORTED)
+  std::cout << "\n*** Cooperative schedule ***" << std::endl;
   run<Gemm>(options);
+  std::cout << "\n*** Pingpong schedule ***" << std::endl;
+  run<GemmPingpong>(options);
 #endif
 
   return 0;
 }
 
 /////////////////////////////////////////////////////////////////////////////////////////////////
+
diff --git a/examples/57_hopper_grouped_gemm/57_hopper_grouped_gemm.cu b/examples/57_hopper_grouped_gemm/57_hopper_grouped_gemm.cu
index f94679568a..d57e1deea5 100644
--- a/examples/57_hopper_grouped_gemm/57_hopper_grouped_gemm.cu
+++ b/examples/57_hopper_grouped_gemm/57_hopper_grouped_gemm.cu
@@ -91,9 +91,9 @@
 
 using namespace cute;
 using ProblemShape = cutlass::gemm::GroupProblemShape<Shape<int,int,int>>; // <M,N,K> per group
-using ElementA = cutlass::float_e4m3_t;                          // Element type for A matrix operand
-using ElementB = cutlass::float_e5m2_t;                          // Element type for B matrix operand
-using ElementC = cutlass::half_t;                                // Element type for C and D matrix operands
+using ElementA = cutlass::float_e4m3_t;                                    // Element type for A matrix operand
+using ElementB = cutlass::float_e5m2_t;                                    // Element type for B matrix operand
+using ElementC = cutlass::half_t;                                          // Element type for C and D matrix operands
 
 #if defined(CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90_SUPPORTED)
 
@@ -117,20 +117,39 @@ constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;    // A
 using ElementAccumulator  = float;                                          // Element type for internal accumulation
 using ArchTag             = cutlass::arch::Sm90;                            // Tag indicating the minimum SM that supports the intended feature
 using OperatorClass       = cutlass::arch::OpClassTensorOp;                 // Operator class tag
-using TileShape           = Shape<_256,_128,_128>;                          // Threadblock-level tile size
-using ClusterShape        = Shape<_2,_2,_1>;                                // Shape of the threadblocks in a cluster
 using StageCountType = cutlass::gemm::collective::StageCountAuto;           // Stage count maximized based on the tile size
-using KernelSchedule = cutlass::gemm::KernelPtrArrayTmaWarpSpecializedCooperativeFP8FastAccum; // Kernel to launch
-using EpilogueSchedule = cutlass::epilogue::PtrArrayNoSmemWarpSpecialized;                     // Epilogue to launch
 
-using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+// Different configs for pingpong/cooperative
+struct CooperativeConfig {
+  using KernelSchedule = cutlass::gemm::KernelPtrArrayTmaWarpSpecializedCooperativeFP8FastAccum;
+  using EpilogueSchedule = cutlass::epilogue::PtrArrayTmaWarpSpecializedCooperative;
+  using TileShape           = Shape<_256,_128,_128>;
+  using ClusterShape        = Shape<_2,_2,_1>;
+};
+
+struct PingpongConfig {
+  using KernelSchedule = cutlass::gemm::KernelPtrArrayTmaWarpSpecializedPingpongFP8FastAccum;
+  using EpilogueSchedule = cutlass::epilogue::PtrArrayTmaWarpSpecializedPingpong;
+  using TileShape           = Shape<_128,_128,_128>;
+  using ClusterShape        = Shape<_2,_1,_1>;
+};
+
+template <typename ScheduleConfig>
+struct GemmGivenSchedule {
+  using TileShape           = typename ScheduleConfig::TileShape;                   // Threadblock-level tile size
+  using ClusterShape        = typename ScheduleConfig::ClusterShape;                // Shape of the threadblocks in a cluster
+  using KernelSchedule      = typename ScheduleConfig::KernelSchedule;              // Kernel to launch
+  using EpilogueSchedule    = typename ScheduleConfig::EpilogueSchedule;            // Epilogue to launch
+
+  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
     cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
     TileShape, ClusterShape,
     cutlass::epilogue::collective::EpilogueTileAuto,
     ElementAccumulator, ElementAccumulator,
     ElementC, LayoutC *, AlignmentC,
     ElementC, LayoutC *, AlignmentC,
-    EpilogueSchedule
+    EpilogueSchedule,
+    cutlass::epilogue::fusion::LinearCombination<ElementC, ElementAccumulator>
   >::CollectiveOp;
 
 using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
@@ -144,13 +163,20 @@ using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder
     KernelSchedule
   >::CollectiveOp;
 
-using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
-    ProblemShape,
-    CollectiveMainloop,
-    CollectiveEpilogue
->;
+  using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+      ProblemShape,
+      CollectiveMainloop,
+      CollectiveEpilogue
+  >;
 
-using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+  using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+};
+
+using GemmKernel = GemmGivenSchedule<CooperativeConfig>::GemmKernel;
+using Gemm = GemmGivenSchedule<CooperativeConfig>::Gemm;
+
+using GemmKernelPingpong = GemmGivenSchedule<PingpongConfig>::GemmKernel;
+using GemmPingpong = GemmGivenSchedule<PingpongConfig>::Gemm;
 
 // Reference device GEMM implementation type
 using DeviceGemmReference = cutlass::reference::device::Gemm<
@@ -271,10 +297,10 @@ struct Options {
       int n = cmd_line_n;
       int k = cmd_line_k;
       if (m < 1) {
-        m = ((rand() % 512) + 1);
+        m = alignment * ((rand() % 64) + 1);
       }
       if (n < 1) {
-        n = ((rand() % 512) + 1);
+        n = alignment * ((rand() % 64) + 1);
       }
       if (k < 1) {
         k = alignment * ((rand() % 64) + 1);
@@ -521,7 +547,8 @@ void initialize(const Options &options) {
 }
 
 /// Populates a Gemm::Arguments structure from the given commandline options
-typename Gemm::Arguments args_from_options(const Options &options, bool host_problem_shapes_available = true)
+template <typename GemmT>
+typename GemmT::Arguments args_from_options(const Options &options, bool host_problem_shapes_available = true)
 {
   cutlass::KernelHardwareInfo hw_info;
   // Change device_id to another value if you are running on a machine with multiple GPUs and wish
@@ -529,33 +556,49 @@ typename Gemm::Arguments args_from_options(const Options &options, bool host_pro
   hw_info.device_id = 0;
   hw_info.sm_count = cutlass::KernelHardwareInfo::query_device_multiprocessor_count(hw_info.device_id);
 
-  typename Gemm::EpilogueOutputOp::Params params;
+  typename GemmT::Arguments arguments;
+  decltype(arguments.epilogue.thread) fusion_args;
+
   if (options.alpha != FLT_MAX && options.beta != FLT_MAX) {
     // If both alpha/beta are provided (via cmd line args) and are scalar, i.e., same alpha/beta applies to all batches.
-    params = typename Gemm::EpilogueOutputOp::Params(
-      ElementAccumulator(options.alpha), ElementAccumulator(options.beta));
+    fusion_args.alpha = options.alpha;
+    fusion_args.beta = options.beta;
+    fusion_args.alpha_ptr = nullptr;
+    fusion_args.beta_ptr = nullptr;
+    fusion_args.alpha_ptr_array = nullptr;
+    fusion_args.beta_ptr_array = nullptr;
+    // Single alpha and beta for all groups
+    fusion_args.dAlpha = {cute::_0{}, cute::_0{}, 0};
+    fusion_args.dBeta = {cute::_0{}, cute::_0{}, 0};
   }
   else {
     // If pointers to alpha/beta are provided, i.e., alpha/beta can differ between batches/groups.
-    params = typename Gemm::EpilogueOutputOp::Params(alpha_device.get(), beta_device.get());
+    fusion_args.alpha = 0;
+    fusion_args.beta = 0;
+    fusion_args.alpha_ptr = nullptr;
+    fusion_args.beta_ptr = nullptr;
+    fusion_args.alpha_ptr_array = alpha_device.get();
+    fusion_args.beta_ptr_array = beta_device.get();
+    // One alpha and beta per each group
+    fusion_args.dAlpha = {cute::_0{}, cute::_0{}, 1};
+    fusion_args.dBeta = {cute::_0{}, cute::_0{}, 1};
   }
 
-  typename Gemm::Arguments arguments;
   if (host_problem_shapes_available) {
-    arguments = typename Gemm::Arguments {
+    arguments = typename GemmT::Arguments {
       cutlass::gemm::GemmUniversalMode::kGrouped,
       {options.groups, problem_sizes.get(), options.problem_sizes_host.data()},
       {ptr_A.get(), stride_A.get(), ptr_B.get(), stride_B.get()},
-      {params, ptr_C.get(), stride_C.get(), ptr_D.get(), stride_D.get()},
+      {fusion_args, ptr_C.get(), stride_C.get(), ptr_D.get(), stride_D.get()},
       hw_info
     };
   }
   else {
-    arguments = typename Gemm::Arguments {
+    arguments = typename GemmT::Arguments {
       cutlass::gemm::GemmUniversalMode::kGrouped,
       {options.groups, problem_sizes.get(), nullptr},
       {ptr_A.get(), stride_A.get(), ptr_B.get(), stride_B.get()},
-      {params, ptr_C.get(), stride_C.get(), ptr_D.get(), stride_D.get()},
+      {fusion_args, ptr_C.get(), stride_C.get(), ptr_D.get(), stride_D.get()},
       hw_info
     };
   }
@@ -605,20 +648,20 @@ bool verify(const Options &options) {
 }
 
 /// Execute a given example GEMM computation
-template <typename Gemm>
+template <typename GemmT>
 int run(Options &options, bool host_problem_shapes_available = true)
 {
   allocate(options);
   initialize(options);
 
   // Instantiate CUTLASS kernel depending on templates
-  Gemm gemm;
+  GemmT gemm;
 
   // Create a structure of gemm kernel arguments suitable for invoking an instance of Gemm
-  auto arguments = args_from_options(options, host_problem_shapes_available);
+  auto arguments = args_from_options<GemmT>(options, host_problem_shapes_available);
 
   // Using the arguments, query for extra workspace required for matrix multiplication computation
-  size_t workspace_size = Gemm::get_workspace_size(arguments);
+  size_t workspace_size = GemmT::get_workspace_size(arguments);
 
   // Allocate workspace memory
   cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
@@ -713,8 +756,14 @@ int main(int argc, char const **args) {
   //
 
 #if defined(CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90_SUPPORTED)
+  std::cout << "\n*** Cooperative schedule ***" << std::endl;
   run<Gemm>(options);
+  std::cout << "\n*** Cooperative schedule (host problem shapes unavailable) ***" << std::endl;
   run<Gemm>(options, false /*host_problem_shapes_available*/);
+  std::cout << "\n*** Pingpong schedule ***" << std::endl;
+  run<GemmPingpong>(options);
+  std::cout << "\n*** Pingpong schedule (host problem shapes unavailable) ***" << std::endl;
+  run<GemmPingpong>(options, false /*host_problem_shapes_available*/);
 #endif
 
   return 0;
diff --git a/examples/57_hopper_grouped_gemm/CMakeLists.txt b/examples/57_hopper_grouped_gemm/CMakeLists.txt
index 2c3ff3a496..1dadbfa813 100644
--- a/examples/57_hopper_grouped_gemm/CMakeLists.txt
+++ b/examples/57_hopper_grouped_gemm/CMakeLists.txt
@@ -32,10 +32,10 @@
 set(TEST_RANDOM --iterations=0)                                                     # Random problem sizes
 set(TEST_RANDOM_LARGE_GROUP --groups=500 --iterations=0)                            # Random problem sizes
 
-set(TEST_EPILOGUE --alpha=0.5 --beta=0.7 --iterations=0)                            # Random problem sizes
+set(TEST_EPILOGUE --alpha=0.5 --beta=0.5 --iterations=0)                            # Random problem sizes
 set(TEST_EPILOGUE_LARGE_GROUP --alpha=1.5 --beta=2.0 --groups=500 --iterations=0)   # Random problem sizes
 
-set(TEST_EPILOGUE_OP --beta=0.7 --iterations=1)                                     # Random problem sizes
+set(TEST_EPILOGUE_OP --beta=0.5 --iterations=1)                                     # Random problem sizes
 set(TEST_EPILOGUE_OP_LARGE_GROUP --alpha=1.5 --iterations=1)                        # Random problem sizes
 
 set(TEST_FIXED --m=2048 --n=5120 --k=8192 --groups=50 --iterations=0)               # Fixed problem sizes
diff --git a/examples/59_ampere_gather_scatter_conv/CMakeLists.txt b/examples/59_ampere_gather_scatter_conv/CMakeLists.txt
index e7f164003d..ce22cd1f37 100644
--- a/examples/59_ampere_gather_scatter_conv/CMakeLists.txt
+++ b/examples/59_ampere_gather_scatter_conv/CMakeLists.txt
@@ -26,6 +26,8 @@
 # OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 
+if (NOT MSVC)
+
 cutlass_example_add_executable(
   59_ampere_gather_scatter_conv
   ampere_gather_scatter_conv.cu
@@ -34,3 +36,5 @@ cutlass_example_add_executable(
 if (CUTLASS_ENABLE_OPENMP_TESTS AND OpenMP_CXX_FOUND)
   target_link_libraries(59_ampere_gather_scatter_conv PRIVATE OpenMP::OpenMP_CXX)
 endif()
+
+endif()
diff --git a/examples/61_hopper_gemm_with_topk_and_softmax/61_hopper_gemm_with_topk_and_softmax.cu b/examples/61_hopper_gemm_with_topk_and_softmax/61_hopper_gemm_with_topk_and_softmax.cu
new file mode 100644
index 0000000000..8bb14b4556
--- /dev/null
+++ b/examples/61_hopper_gemm_with_topk_and_softmax/61_hopper_gemm_with_topk_and_softmax.cu
@@ -0,0 +1,534 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+/*! \file
+    \brief  Hopper GEMM + Top-K + Softmax fusion
+
+    This example illustrates how to use the LinCombTopKSoftmaxCol EVT node to fuse
+    Top-K and Softmax into the GEMM epilogue, with certain assumptions made.
+
+    Those assumptions are as:
+      1. Fusion is over the N dimension.
+      2. Top-K is either 2 or 4 elements, and the value is static (meaning two kernels have to be
+         compiled to support both.)
+      3. The GEMM tile shape along N is greater than or equal to problem size
+         along N.
+
+
+    The example runs the fused GEMM kernel, along with a standard unfused host reference, and
+    manually performs Top-K and softmax, and compares the error between tensors.
+
+    Note that some numerical error (smaller than 1e-5) is to be expected, but this is true
+    in most efficient reduction kernels, because floating point addition is not necessarily
+    associative.
+*/
+
+#include <iostream>
+
+#include "cutlass/cutlass.h"
+#include "cutlass/numeric_types.h"
+
+#include "cute/tensor.hpp"
+#include "cutlass/tensor_ref.h"
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/epilogue/dispatch_policy.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+
+#include "cutlass/util/command_line.h"
+#include "cutlass/util/distribution.h"
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/packed_stride.hpp"
+#include "cutlass/util/tensor_view_io.h"
+#include "cutlass/util/reference/host/error_metrics.h"
+#include "cutlass/util/reference/host/tensor_fill.h"
+#include "cutlass/util/reference/host/tensor_copy.h"
+#include "cutlass/util/reference/host/tensor_compare.h"
+#include "cutlass/util/reference/host/tensor_norm.h"
+#include "cutlass/util/reference/host/gett.hpp"
+
+
+#include "helper.h"
+
+using namespace cute;
+
+#if defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+
+static constexpr int TopK = 2;
+static constexpr bool EnableTopKSoftmax = TopK > 1;
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM kernel configurations
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// A matrix configuration
+using         ElementA    = cutlass::half_t;                                // Element type for A matrix operand
+using         LayoutA     = cutlass::layout::RowMajor;                      // Layout type for A matrix operand
+constexpr int AlignmentA  = 128 / cutlass::sizeof_bits<ElementA>::value;    // Memory access granularity/alignment of A matrix in units of elements (up to 16 bytes)
+
+// B matrix configuration
+using         ElementB    = cutlass::half_t;                                // Element type for B matrix operand
+using         LayoutB     = cutlass::layout::ColumnMajor;                   // Layout type for B matrix operand
+constexpr int AlignmentB  = 128 / cutlass::sizeof_bits<ElementB>::value;    // Memory access granularity/alignment of B matrix in units of elements (up to 16 bytes)
+
+// C matrix configuration
+using         ElementC    = void;
+using         LayoutC     = cutlass::layout::RowMajor;
+constexpr int AlignmentC  = 1;
+
+// D matrix configuration
+using         ElementD    = cutlass::half_t;                                // Element type for C and D matrix operands
+using         LayoutD     = cutlass::layout::RowMajor;                      // Layout type for output
+constexpr int AlignmentD  = 128 / cutlass::sizeof_bits<ElementD>::value;    // Memory access granularity/alignment of output in units of elements (up to 16 bytes)
+
+// Core kernel configurations
+using ElementAccumulator  = float;                                          // Element type for internal accumulation
+using ElementCompute      = float;                                          // Element type for epilogue computation
+using ArchTag             = cutlass::arch::Sm90;                            // Tag indicating the minimum SM that supports the intended feature
+using OperatorClass       = cutlass::arch::OpClassTensorOp;                 // Operator class tag
+using TileShape           = Shape<_64,_64,_128>;                            // Threadblock-level tile size
+using ClusterShape        = Shape<_1,_1,_1>;                                // Shape of the threadblocks in a cluster
+using KernelSchedule      = cutlass::gemm::KernelTmaWarpSpecialized;
+using EpilogueSchedule    = cutlass::epilogue::TmaWarpSpecialized;
+
+// Top-K + Softmax fusion operation
+using FusionOperation     = std::conditional_t<EnableTopKSoftmax,
+  typename cutlass::epilogue::fusion::LinCombTopKSoftmaxCol<TopK, ElementD, ElementCompute>,
+  typename cutlass::epilogue::fusion::LinearCombination<ElementD, ElementCompute, ElementC, ElementCompute>
+>;
+
+// The fusion op only allows for epilogue tiles matching the mainloop tile.
+using EpilogueTileType    = decltype(cute::take<0,2>(TileShape{}));
+
+using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    TileShape, ClusterShape,
+    EpilogueTileType,
+    ElementAccumulator, ElementCompute,
+    ElementC, LayoutC, AlignmentC,
+    ElementD, LayoutD, AlignmentD,
+    EpilogueSchedule,
+    FusionOperation
+  >::CollectiveOp;
+
+using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    ElementA, LayoutA, AlignmentA,
+    ElementB, LayoutB, AlignmentB,
+    ElementAccumulator,
+    TileShape, ClusterShape,
+    cutlass::gemm::collective::StageCountAutoCarveout<
+      static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))
+    >,
+    KernelSchedule
+  >::CollectiveOp;
+
+using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+    Shape<int,int,int,int>, // Indicates ProblemShape
+    CollectiveMainloop,
+    CollectiveEpilogue
+>;
+
+using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+
+// Extract information from Gemm kernel.
+using EpilogueOutputOp  = typename Gemm::EpilogueOutputOp;
+using ElementScalar     = typename EpilogueOutputOp::ElementScalar;
+
+using StrideA = typename Gemm::GemmKernel::StrideA;
+using StrideB = typename Gemm::GemmKernel::StrideB;
+using StrideD = typename Gemm::GemmKernel::StrideD;
+
+/// Initialization
+StrideA stride_A;
+StrideB stride_B;
+StrideD stride_D;
+uint64_t seed;
+
+cutlass::HostTensor<ElementA  , LayoutA  > tensor_A;
+cutlass::HostTensor<ElementB  , LayoutB  > tensor_B;
+cutlass::HostTensor<ElementD  , LayoutD  > tensor_D;
+cutlass::HostTensor<ElementD  , LayoutD  > tensor_ref_D;
+
+using LayoutScalar = cutlass::layout::PackedVectorLayout;
+
+#endif // defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Testbed utility types
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// Command line options parsing
+struct Options {
+
+  bool help = false;
+
+  int iterations = 1000;
+  int m = 16, n = 8, k = 64, l = 1;
+  double eps = 1e-5;
+
+  // Parses the command line
+  void parse(int argc, char const **args) {
+    cutlass::CommandLine cmd(argc, args);
+
+    if (cmd.check_cmd_line_flag("help")) {
+      help = true;
+      return;
+    }
+
+    cmd.get_cmd_line_argument("m", m);
+    cmd.get_cmd_line_argument("n", n);
+    cmd.get_cmd_line_argument("k", k);
+    cmd.get_cmd_line_argument("l", l);
+    cmd.get_cmd_line_argument("iterations", iterations);
+    cmd.get_cmd_line_argument("eps", eps);
+  }
+
+  /// Prints the usage statement.
+  std::ostream & print_usage(std::ostream &out) const {
+
+    out << "61_hopper_gemm_with_topk_and_softmax\n\n"
+      << "  Hopper FP8 GEMM with Top-K and softmax fusion.\n\n"
+      << "Options:\n\n"
+      << "  --help                      If specified, displays this usage statement\n\n"
+      << "  --m=<int>                   Sets the M extent of the GEMM\n"
+      << "  --n=<int>                   Sets the N extent of the GEMM\n"
+      << "  --k=<int>                   Sets the K extent of the GEMM\n"
+      << "  --l=<int>                   Sets the l extent (batch) of the GEMM\n"
+      << "  --iterations=<int>          Number of profiling iterations to perform.\n\n"
+      << "  --eps=<float>               Threshold of numerical verification. Default: 1e-5.\n\n";
+
+    out
+      << "\n\nExamples:\n\n"
+      << "$ " << "61_hopper_gemm_with_topk_and_softmax" << " --m=16 --n=8 --k=1024 \n\n";
+
+    return out;
+  }
+
+  /// Compute performance in GFLOP/s
+  double gflops(double runtime_s) const
+  {
+    // Two flops per multiply-add
+    uint64_t flop = uint64_t(2) * m * n * k;
+    double gflop = double(flop) / double(1.0e9);
+    return gflop / runtime_s;
+  }
+
+  float alpha() const {
+    return 1.f / static_cast<float>(k);
+  }
+};
+
+/// Result structure
+struct Result {
+  double avg_runtime_ms;
+  double gflops;
+  cutlass::Status status;
+  cudaError_t error;
+  bool passed;
+
+  Result(
+    double avg_runtime_ms = 0,
+    double gflops = 0,
+    cutlass::Status status = cutlass::Status::kSuccess,
+    cudaError_t error = cudaSuccess)
+  :
+    avg_runtime_ms(avg_runtime_ms), gflops(gflops), status(status), error(error), passed(false)
+  {}
+
+};
+
+#if defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM setup and evaluation
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Helper to initialize a block of device data
+template <typename Element, typename Layout>
+bool initialize_tensor(
+    cutlass::TensorView<Element, Layout> view,
+    uint64_t seed) {
+  cutlass::reference::host::TensorFillRandomUniform(
+    view, seed, /* max = */ 1, /* min = */ -1, /* bits = */ 2);
+  return true;
+}
+
+/// Initialize operands to be used in the GEMM and reference GEMM
+void initialize(const Options &options) {
+
+  stride_A = cutlass::make_cute_packed_stride(StrideA{}, cute::make_shape(options.m, options.k, options.l));
+  stride_B = cutlass::make_cute_packed_stride(StrideB{}, cute::make_shape(options.n, options.k, options.l));
+  stride_D = cutlass::make_cute_packed_stride(StrideD{}, cute::make_shape(options.m, options.n, options.l));
+
+  auto a_coord = cutlass::make_Coord(options.m * options.l, options.k);
+  auto c_coord = cutlass::make_Coord(options.m * options.l, options.n);
+  auto b_coord = cutlass::make_Coord(options.k, options.n * options.l);
+
+  tensor_A.resize(a_coord);
+  tensor_B.resize(b_coord);
+  tensor_D.resize(c_coord);
+  tensor_ref_D.resize(c_coord);
+
+  initialize_tensor(tensor_A.host_view(), seed + 2022);
+  initialize_tensor(tensor_B.host_view(), seed + 2023);
+
+  tensor_A.sync_device();
+  tensor_B.sync_device();
+  tensor_D.sync_device();
+}
+
+/// Populates a Gemm::Arguments structure from the given commandline options
+typename Gemm::Arguments args_from_options(const Options &options) {
+  typename Gemm::Arguments arguments{
+    cutlass::gemm::GemmUniversalMode::kGemm,
+    {options.m, options.n, options.k, options.l},
+    {tensor_A.device_data(), stride_A, tensor_B.device_data(), stride_B},
+    {
+      {options.alpha(), 0.f}, // alpha, beta
+      nullptr, stride_D,
+      tensor_D.device_data(), stride_D
+    }
+  };
+
+  return arguments;
+}
+
+bool verify(const Options &options) {
+  //
+  // Compute reference output
+  //
+
+  // Create instantiation for device reference gemm kernel
+  auto A = cute::make_tensor(tensor_A.host_data(),
+      cute::make_layout(cute::make_shape(options.m, options.k, options.l), stride_A));
+  auto B = cute::make_tensor(tensor_B.host_data(),
+      cute::make_layout(cute::make_shape(options.n, options.k, options.l), stride_B));
+  auto D = cute::make_tensor(tensor_ref_D.host_data(),
+      cute::make_layout(cute::make_shape(options.m, options.n, options.l), stride_D));
+  using unused_t = decltype(D);
+
+  cutlass::reference::host::GettMainloopParams<ElementAccumulator, decltype(A), decltype(B)> mainloop_params{A, B};
+
+  cutlass::reference::host::GettEpilogueParams<
+      ElementScalar,
+      ElementScalar,
+      ElementAccumulator,
+      ElementCompute,
+      unused_t,
+      decltype(D),
+      unused_t, // bias
+      unused_t, // aux
+      unused_t, // valpha
+      unused_t  // vbeta
+  > epilogue_params;
+
+  epilogue_params.D = D;
+  epilogue_params.alpha = options.alpha();
+  epilogue_params.beta = 0.f;
+
+  // get reference result
+  cutlass::reference::host::Gemm3x(mainloop_params, epilogue_params);
+
+  if constexpr (EnableTopKSoftmax) {
+    // top-K + softmax
+    for (int i = 0; i < options.m; ++i) {
+
+      // Find Top-K
+      cutlass::Array<ElementAccumulator, TopK> top_k;
+      top_k.fill(-cutlass::platform::numeric_limits<ElementCompute>::infinity());
+      for (int j = 0; j < options.n; ++j) {
+        auto val = static_cast<ElementAccumulator>(tensor_ref_D.host_view().ref().at({i, j}));
+        for (int top_k_idx = 0; top_k_idx < TopK; ++top_k_idx) {
+          if (val > top_k[top_k_idx]) {
+            // Shift down
+            for (int l = TopK - 1; l > top_k_idx; --l) {
+              top_k[l] = top_k[l - 1];
+            }
+            top_k[top_k_idx] = val;
+            break;
+          }
+        }
+      }
+
+      // This formulation of top-K + softmax only works when it is
+      // guaranteed that none of the top-K elements are repeated!
+      // If this is the case, the device kernel can also make mistakes, because
+      //   A. Once the top-K values are reduced, and the operation is being applied,
+      //      there is no way to tell repeated elements apart, so none are masked.
+      //   B. The softmax sum of exps will be incorrect (because the repeated elements
+      //      are not repeated in it.)
+
+      ElementAccumulator max = top_k[0];
+      ElementAccumulator sum = ElementAccumulator(0.f);
+      for (int top_k_idx = 0; top_k_idx < TopK; ++top_k_idx) {
+        sum = sum + cutlass::fast_exp(top_k[top_k_idx] - max);
+      }
+
+      for (int j=0; j < options.n; ++j) {
+        auto val = tensor_ref_D.host_view().ref().at({i, j});
+        if (val < top_k[TopK - 1]) {
+          tensor_ref_D.host_view().ref().at({i, j}) = static_cast<ElementD>(0.f);
+        } else {
+          // Softmax
+          auto softmax_val = cutlass::fast_exp(val - max) / sum;
+          tensor_ref_D.host_view().ref().at({i, j}) = static_cast<ElementD>(softmax_val);
+        }
+      }
+    }
+  }
+
+  // compare_reference
+  tensor_D.sync_host();
+
+  double err = cutlass::reference::host::TensorRelativeErrorMetric(
+    tensor_D.host_view(),
+    tensor_ref_D.host_view());
+  bool passed = err < options.eps;
+
+  if (options.m <= 32 && options.n <= 32) {
+    std::cout << "GEMM output:\n" << tensor_D.host_view() << "\n\n";
+    std::cout << "Reference output:\n" << tensor_ref_D.host_view() << "\n\n";
+  }
+
+  std::cout << "  Disposition: " << (passed ? "Passed" : "Failed") << " \t Relative error: " << err << std::endl;
+
+  return passed;
+}
+
+/// Execute a given example GEMM computation
+template <typename Gemm>
+int run(Options &options) {
+  initialize(options);
+
+  // Instantiate CUTLASS kernel depending on templates
+  Gemm gemm;
+
+  // Create a structure of gemm kernel arguments suitable for invoking an instance of Gemm
+  auto arguments = args_from_options(options);
+
+  // Using the arguments, query for extra workspace required for matrix multiplication computation
+  size_t workspace_size = Gemm::get_workspace_size(arguments);
+
+  // Allocate workspace memory
+  cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+
+  // Check if the problem size is supported or not
+  CUTLASS_CHECK(gemm.can_implement(arguments));
+
+  // Initialize CUTLASS kernel with arguments and workspace pointer
+  CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+
+  // Correctness / Warmup iteration
+  CUTLASS_CHECK(gemm.run());
+
+  // Check if output from CUTLASS kernel and reference kernel are equal or not
+  Result result;
+  result.passed = verify(options);
+
+  if (!result.passed) {
+    exit(-1);
+  }
+
+  // Run profiling loop
+  if (options.iterations > 0) {
+    GpuTimer timer;
+    timer.start();
+    for (int iter = 0; iter < options.iterations; ++iter) {
+      CUTLASS_CHECK(gemm.run());
+    }
+    timer.stop();
+
+    // Compute average runtime and GFLOPs.
+    float elapsed_ms = timer.elapsed_millis();
+    result.avg_runtime_ms = double(elapsed_ms) / double(options.iterations);
+    result.gflops = options.gflops(result.avg_runtime_ms / 1000.0);
+
+    std::cout << "  Problem Size: " << options.m << 'x' << options.n << 'x' << options.k << 'x' << options.l << std::endl;
+    std::cout << "  Avg runtime: " << result.avg_runtime_ms << " ms" << std::endl;
+    std::cout << "  GFLOPS: " << result.gflops << std::endl;
+  }
+
+  return 0;
+}
+
+#endif // defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+int main(int argc, char const **args) {
+
+  // CUTLASS must be compiled with CUDA 12.0 Toolkit to run this example
+  // and must have compute capability at least 90.
+  if (__CUDACC_VER_MAJOR__ < 12) {
+    std::cerr << "This example requires CUDA 12 or newer.\n";
+    // Returning zero so this test passes on older Toolkits. Its actions are no-op.
+    return 0;
+  }
+
+  cudaDeviceProp props;
+  int current_device_id;
+  CUDA_CHECK(cudaGetDevice(&current_device_id));
+  CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
+  cudaError_t error = cudaGetDeviceProperties(&props, 0);
+  if (props.major < 9) {
+    std::cerr
+      << "This example requires a GPU of NVIDIA's Hopper Architecture or "
+      << "later (compute capability 90 or greater).\n";
+    return 0;
+  }
+  //
+  // Parse options
+  //
+
+  Options options;
+
+  options.parse(argc, args);
+
+  if (options.help) {
+    options.print_usage(std::cout) << std::endl;
+    return 0;
+  }
+
+  //
+  // Evaluate CUTLASS kernels
+  //
+
+#if defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+  run<Gemm>(options);
+#endif
+
+  return 0;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/examples/61_hopper_gemm_with_topk_and_softmax/CMakeLists.txt b/examples/61_hopper_gemm_with_topk_and_softmax/CMakeLists.txt
new file mode 100644
index 0000000000..7d9160a733
--- /dev/null
+++ b/examples/61_hopper_gemm_with_topk_and_softmax/CMakeLists.txt
@@ -0,0 +1,32 @@
+# Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+cutlass_example_add_executable(
+  61_hopper_gemm_with_topk_and_softmax
+  61_hopper_gemm_with_topk_and_softmax.cu
+  )
diff --git a/examples/62_hopper_sparse_gemm/62_hopper_sparse_gemm.cu b/examples/62_hopper_sparse_gemm/62_hopper_sparse_gemm.cu
new file mode 100644
index 0000000000..c3f1ce709a
--- /dev/null
+++ b/examples/62_hopper_sparse_gemm/62_hopper_sparse_gemm.cu
@@ -0,0 +1,596 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+/*! \file
+    \brief Hopper Sparse GEMM example.
+
+  This example demonstrates how to construct and run a structured sparse GEMM kernel
+  on NVIDIA Hopper architecture.
+    
+*/
+
+#include <iostream>
+
+#include "cutlass/cutlass.h"
+
+#include "cute/tensor.hpp"
+#include "cutlass/tensor_ref.h"
+#include "cutlass/epilogue/collective/default_epilogue.hpp"
+#include "cutlass/epilogue/thread/linear_combination.h"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/transform/device/transform_universal_adapter.hpp"
+#include "cutlass/transform/kernel/sparse_gemm_compressor.hpp"
+
+#include "cutlass/util/command_line.h"
+#include "cutlass/util/distribution.h"
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/packed_stride.hpp"
+#include "cutlass/util/tensor_view_io.h"
+#include "cutlass/util/reference/device/gemm.h"
+#include "cutlass/util/reference/device/tensor_compare.h"
+#include "cutlass/util/reference/device/tensor_fill.h"
+
+#include "helper.h"
+
+using namespace cute;
+
+#if defined(CUTLASS_ARCH_MMA_SPARSE_SM90_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM kernel configurations
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// A matrix configuration
+using         ElementA    = cutlass::half_t;                                // Element type for A matrix operand
+using         LayoutTagA  = cutlass::layout::RowMajor;                      // Layout type for A matrix operand
+constexpr int AlignmentA  = 128 / cutlass::sizeof_bits<ElementA>::value;    // Memory access granularity/alignment of A matrix in units of elements (up to 16 bytes)
+
+// B matrix configuration
+using         ElementB    = cutlass::half_t;                                // Element type for B matrix operand
+using         LayoutTagB  = cutlass::layout::ColumnMajor;                   // Layout type for B matrix operand
+constexpr int AlignmentB  = 128 / cutlass::sizeof_bits<ElementB>::value;    // Memory access granularity/alignment of B matrix in units of elements (up to 16 bytes)
+
+// C/D matrix configuration
+using         ElementC    = float;                                          // Element type for C and D matrix operands
+using         LayoutTagC  = cutlass::layout::ColumnMajor;                   // Layout type for C and D matrix operands
+constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;    // Memory access granularity/alignment of C matrix in units of elements (up to 16 bytes)
+
+// Core kernel configurations
+using ElementAccumulator  = float;                                          // Element type for internal accumulation
+using TileShape           = Shape<_128,_128,_128>;                          // Threadblock-level tile size for sparse kernel
+using TileShapeRef        = Shape<_128,_128, _64>;                          // Threadblock-level tile size for reference (dense) kernel
+using ClusterShape        = Shape<_1,_2,_1>;                                // Shape of the threadblocks in a cluster
+using KernelSchedule      = cutlass::gemm::KernelTmaWarpSpecialized;        // Kernel schedule policy
+using EpilogueSchedule    = cutlass::epilogue::TmaWarpSpecialized;          // Epilogue schedule policy
+
+using ProblemShape = Shape<int,int,int,int>;
+
+// Sparse kernel setup
+
+using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+    cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+    TileShape, ClusterShape,
+    cutlass::epilogue::collective::EpilogueTileAuto,
+    ElementAccumulator, ElementAccumulator,
+    ElementC, LayoutTagC, AlignmentC,
+    ElementC, LayoutTagC, AlignmentC,
+    EpilogueSchedule
+  >::CollectiveOp;
+
+using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+    cutlass::arch::Sm90, cutlass::arch::OpClassSparseTensorOp,
+    ElementA, LayoutTagA, AlignmentA,
+    ElementB, LayoutTagB, AlignmentB,
+    ElementAccumulator,
+    TileShape, ClusterShape,
+    cutlass::gemm::collective::StageCountAutoCarveout<
+      static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+    KernelSchedule
+  >::CollectiveOp;
+
+using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+    ProblemShape,
+    CollectiveMainloop,
+    CollectiveEpilogue
+>;
+
+using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+
+// Reference (dense) kernel setup
+
+using CollectiveEpilogueRef = typename cutlass::epilogue::collective::CollectiveBuilder<
+    cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+    TileShapeRef, ClusterShape,
+    cutlass::epilogue::collective::EpilogueTileAuto,
+    ElementAccumulator, ElementAccumulator,
+    ElementC, LayoutTagC, AlignmentC,
+    ElementC, LayoutTagC, AlignmentC,
+    EpilogueSchedule
+  >::CollectiveOp;
+
+using CollectiveMainloopRef = typename cutlass::gemm::collective::CollectiveBuilder<
+    cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+    ElementA, LayoutTagA, AlignmentA,
+    ElementB, LayoutTagB, AlignmentB,
+    ElementAccumulator,
+    TileShapeRef, ClusterShape,
+    cutlass::gemm::collective::StageCountAutoCarveout<
+      static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+    KernelSchedule
+  >::CollectiveOp;
+
+using GemmKernelRef = cutlass::gemm::kernel::GemmUniversal<
+    ProblemShape,
+    CollectiveMainloopRef,
+    CollectiveEpilogue
+>;
+
+using GemmRef = cutlass::gemm::device::GemmUniversalAdapter<GemmKernelRef>;
+
+// Layouts 
+using LayoutA = typename Gemm::GemmKernel::CollectiveMainloop::LayoutA;
+using LayoutE = typename Gemm::GemmKernel::CollectiveMainloop::LayoutE;
+using StrideB = typename Gemm::GemmKernel::StrideB;
+using StrideC = typename Gemm::GemmKernel::StrideC;
+using StrideD = typename Gemm::GemmKernel::StrideD;
+
+// Layouts for reference (non-sparse) tensors
+using StrideA = cutlass::gemm::TagToStrideA_t<LayoutTagA>;
+using StrideE = StrideA;
+
+using ElementE = typename Gemm::GemmKernel::CollectiveMainloop::ElementE;
+using SparseConfig = typename Gemm::GemmKernel::CollectiveMainloop::SparseConfig;
+
+// Offline compressor kernel
+using CompressorUtility = cutlass::transform::kernel::StructuredSparseCompressorUtility<
+                            ProblemShape,
+                            ElementA,
+                            LayoutTagA,
+                            SparseConfig>;
+
+using CompressorKernel = cutlass::transform::kernel::StructuredSparseCompressor<
+                            ProblemShape,
+                            ElementA,
+                            LayoutTagA,
+                            SparseConfig,
+                            cutlass::arch::Sm90>;
+
+using Compressor = cutlass::transform::device::TransformUniversalAdapter<CompressorKernel>;
+
+//
+// Data members
+//
+
+ProblemShape problem_shape;
+
+StrideA stride_A;
+StrideA stride_A_compressed;
+StrideE stride_E;
+StrideB stride_B;
+StrideC stride_C;
+StrideD stride_D;
+
+LayoutA layout_A;
+LayoutE layout_E;
+
+uint64_t seed;
+
+cutlass::DeviceAllocation<typename Gemm::ElementA> block_A;
+cutlass::DeviceAllocation<typename Gemm::ElementA> block_A_compressed;
+cutlass::DeviceAllocation<typename Gemm::CollectiveMainloop::ElementE> block_E;
+cutlass::DeviceAllocation<typename Gemm::ElementB> block_B;
+cutlass::DeviceAllocation<typename Gemm::ElementC> block_C;
+cutlass::DeviceAllocation<typename Gemm::EpilogueOutputOp::ElementOutput> block_D;
+cutlass::DeviceAllocation<typename Gemm::EpilogueOutputOp::ElementOutput> block_D_ref;
+
+#endif // defined(CUTLASS_ARCH_MMA_SPARSE_SM90_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Testbed utility types
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// Command line options parsing
+struct Options {
+
+  bool help;
+
+  float alpha, beta;
+  int iterations;
+  int m, n, k, l;
+
+  Options():
+    help(false),
+    m(5120), n(4096), k(16384), l(1),
+    alpha(1.f), beta(0.f),
+    iterations(10)
+  { }
+
+  // Parses the command line
+  void parse(int argc, char const **args) {
+    cutlass::CommandLine cmd(argc, args);
+
+    if (cmd.check_cmd_line_flag("help")) {
+      help = true;
+      return;
+    }
+
+    cmd.get_cmd_line_argument("m", m);
+    cmd.get_cmd_line_argument("n", n);
+    cmd.get_cmd_line_argument("k", k);
+    cmd.get_cmd_line_argument("l", l);
+    cmd.get_cmd_line_argument("alpha", alpha);
+    cmd.get_cmd_line_argument("beta", beta);
+    cmd.get_cmd_line_argument("iterations", iterations);
+  }
+
+  /// Prints the usage statement.
+  std::ostream & print_usage(std::ostream &out) const {
+
+    out << "62_hopper_sparse_gemm\n\n"
+      << "  Hopper Sparse GEMM example.\n\n"
+      << "Options:\n\n"
+      << "  --help                      If specified, displays this usage statement\n\n"
+      << "  --m=<int>                   Sets the M extent of the GEMM\n"
+      << "  --n=<int>                   Sets the N extent of the GEMM\n"
+      << "  --k=<int>                   Sets the K extent of the GEMM\n"
+      << "  --l=<int>                   Sets the L extent of the GEMM (batch size)\n"
+      << "  --alpha=<f32>               Epilogue scalar alpha\n"
+      << "  --beta=<f32>                Epilogue scalar beta\n\n"
+      << "  --iterations=<int>          Number of profiling iterations to perform.\n\n";
+
+    out
+      << "\n\nExamples:\n\n"
+      << "$ " << "62_hopper_sparse_gemm" << " --m=4096 --n=5120 --k=8192 --l=1 --alpha=2 --beta=0.707 \n\n";
+
+    return out;
+  }
+
+  /// Compute performance in GFLOP/s
+  double gflops(double runtime_s) const
+  {
+    // Two flops per multiply-add
+    uint64_t flop = uint64_t(2) * m * n * k;
+    double gflop = double(flop) / double(1.0e9);
+    return gflop / runtime_s;
+  }
+};
+
+#if defined(CUTLASS_ARCH_MMA_SPARSE_SM90_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM setup and evaluation
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Helper to initialize a block of device data
+template <class Element>
+bool initialize_block(
+  cutlass::DeviceAllocation<Element>& block,
+  uint64_t seed) {
+
+  Element scope_max, scope_min;
+  int bits_input = cutlass::sizeof_bits<Element>::value;
+
+  if (bits_input == 1) {
+    scope_max = Element(2);
+    scope_min = Element(0);
+  } else if (bits_input <= 8) {
+    scope_max = Element(2);
+    scope_min = Element(-2);
+  } else {
+    scope_max = Element(8);
+    scope_min = Element(-8);
+  }
+
+  cutlass::reference::device::BlockFillRandomUniform(
+    block.get(), block.size(), seed, scope_max, scope_min, 0);
+
+  return true;
+}
+
+/// Make A structured sparse by replacing elements with 0 and compress it
+bool sparsify_and_compress()
+{
+  auto [M, N, K, L] = problem_shape;
+  CompressorUtility compressor_utility(problem_shape, stride_A);
+
+  int ME = compressor_utility.get_metadata_m_physical();
+  int KE = compressor_utility.get_metadata_k_physical();
+  int KC = compressor_utility.get_tensorA_k_physical();
+
+  block_A_compressed.reset(M * KC * L);
+  block_E.reset(ME * KE * L);
+
+  stride_A_compressed = cutlass::make_cute_packed_stride(StrideA{}, cute::make_shape(M, KC, L));
+  stride_E = cutlass::make_cute_packed_stride(StrideE{}, cute::make_shape(ME, KE, L));
+
+  // Random sparsification is performed on host
+  std::vector<ElementA> block_A_host(block_A.size());
+  cutlass::device_memory::copy_to_host(block_A_host.data(), block_A.get(), block_A.size());
+  compressor_utility.structure_sparse_zero_mask_fill(block_A_host.data(), static_cast<int>(seed + 2024));
+  cutlass::device_memory::copy_to_device(block_A.get(), block_A_host.data(), block_A.size());
+
+  cutlass::KernelHardwareInfo hw_info;
+  hw_info.device_id = 0;
+  hw_info.sm_count = cutlass::KernelHardwareInfo::query_device_multiprocessor_count(hw_info.device_id);
+  typename Compressor::Arguments arguments {
+    problem_shape,
+    { block_A.get(),
+      stride_A,
+      block_A_compressed.get(),
+      block_E.get() },
+    {hw_info} };
+
+  Compressor compressor_op;
+  size_t workspace_size = Compressor::get_workspace_size(arguments);
+  cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+
+  CUTLASS_CHECK(compressor_op.can_implement(arguments));
+  CUTLASS_CHECK(compressor_op.initialize(arguments, workspace.get()));
+  CUTLASS_CHECK(compressor_op.run());
+  CUDA_CHECK(cudaDeviceSynchronize());
+
+  return true;
+}
+
+/// Initialize operands to be used in the GEMM and reference GEMM
+bool initialize(Options const& options) {
+
+  problem_shape = make_tuple(options.m, options.n, options.k, options.l);
+  auto [M, N, K, L] = problem_shape;
+
+  stride_A = cutlass::make_cute_packed_stride(StrideA{}, cute::make_shape(M, K, L));
+  stride_B = cutlass::make_cute_packed_stride(StrideB{}, cute::make_shape(N, K, L));
+  stride_C = cutlass::make_cute_packed_stride(StrideC{}, cute::make_shape(M, N, L));
+  stride_D = cutlass::make_cute_packed_stride(StrideD{}, cute::make_shape(M, N, L));
+
+  // Allocate memory for tensors
+  block_A.reset(M * K * L);
+  block_B.reset(N * K * L);
+  block_C.reset(M * N * L);
+  block_D.reset(M * N * L);
+  block_D_ref.reset(M * N * L);
+
+  // Fill input tensors with data
+  initialize_block(block_A, seed + 2021);
+  initialize_block(block_B, seed + 2022);
+  initialize_block(block_C, seed + 2023);
+
+  // Replace 0 in A with 1 to avoid metadata changes
+  std::vector<ElementA> block_A_host(block_A.size());
+  cutlass::device_memory::copy_to_host(block_A_host.data(), block_A.get(), block_A.size());
+  for (size_t i = 0; i < block_A.size(); ++i) if (block_A_host[i] == ElementA(0)) block_A_host[i] = ElementA(1.0);
+  cutlass::device_memory::copy_to_device(block_A.get(), block_A_host.data(), block_A.size());
+
+  if (!sparsify_and_compress()) {
+    return false;
+  };
+
+  // Build the compressed/metadata layouts
+  layout_A = SparseConfig::fill_layoutA(problem_shape);
+  layout_E = SparseConfig::fill_layoutE(problem_shape);
+
+  return true;
+}
+
+/// Populates a Gemm::Arguments structure from the given commandline options
+typename Gemm::Arguments make_args(Options const& options)
+{
+  typename Gemm::Arguments arguments{
+    cutlass::gemm::GemmUniversalMode::kGemm,
+    problem_shape,
+    { block_A_compressed.get(), layout_A, block_B.get(), stride_B, block_E.get(), layout_E },
+    { { ElementAccumulator(options.alpha), ElementAccumulator(options.beta) },
+      block_C.get(), stride_C, block_D.get(), stride_D }
+  };
+
+  return arguments;
+}
+
+typename GemmRef::Arguments make_args_ref(Options const& options)
+{
+  typename GemmRef::Arguments arguments{
+    cutlass::gemm::GemmUniversalMode::kGemm,
+    problem_shape,
+    { block_A.get(), stride_A, block_B.get(), stride_B },
+    { { ElementAccumulator(options.alpha), ElementAccumulator(options.beta) },
+      block_C.get(), stride_C, block_D_ref.get(), stride_D }
+  };
+
+  return arguments;
+}
+
+template<class Engine, class Layout>
+void print_device_tensor(cute::Tensor<Engine, Layout> const& t)
+{
+  // Assumes size = cosize, i.e. compact tensor
+  std::vector<typename Engine::value_type> data_host(t.size());
+  cutlass::device_memory::copy_to_host(data_host.data(), t.data(), t.size());
+  auto t_host = cute::make_tensor(data_host.data(), t.layout());
+  cute::print_tensor(t_host);
+}
+
+bool verify(Options const& options) {
+  CUDA_CHECK(cudaDeviceSynchronize());
+
+  bool passed = cutlass::reference::device::BlockCompareEqual(block_D_ref.get(), block_D.get(), block_D.size());
+
+#if 0
+  if (!passed) {
+    auto [M, N, K, L] = problem_shape;
+    CompressorUtility compressor_utility(problem_shape, stride_A);
+    int ME = compressor_utility.get_metadata_m_physical();
+    int KE = compressor_utility.get_metadata_k_physical();
+    int KC = compressor_utility.get_tensorA_k_physical();
+
+    cute::print("A (original): "); print_device_tensor(make_tensor(block_A.get(), make_shape(M, K, L), stride_A));
+    cute::print("A (compressed): "); print_device_tensor(make_tensor(block_A_compressed.get(), make_shape(M, KC, L), stride_A_compressed));
+    cute::print("E (physical): "); print_device_tensor(make_tensor(block_E.get(), make_shape(ME, KE, L), stride_E));
+    cute::print("E (logical): "); print_device_tensor(make_tensor(block_E.get(), upcast<CollectiveMainloop::ElementEMmaSparsity>(layout_E)));
+    cute::print("B: "); print_device_tensor(make_tensor(block_B.get(), make_shape(N, K, L), stride_B));
+    cute::print("C: "); print_device_tensor(make_tensor(block_C.get(), make_shape(M, N, L), stride_C));
+    cute::print("D reference: "); print_device_tensor(make_tensor(block_D_ref.get(), make_shape(M, N, L), stride_D));
+    cute::print("D  computed: "); print_device_tensor(make_tensor(block_D.get(), make_shape(M, N, L), stride_D));
+  }
+#endif
+
+  return passed;
+}
+
+template<typename Gemm>
+struct Runner
+{
+  using Arguments = typename Gemm::Arguments;
+
+  Runner(Arguments args): arguments(args) {
+    // Using the arguments, query for extra workspace required for matrix multiplication computation
+    size_t workspace_size = Gemm::get_workspace_size(arguments);
+
+    // Allocate workspace memory
+    workspace.reset(workspace_size);
+
+    // Check if the problem size is supported or not
+    CUTLASS_CHECK(gemm.can_implement(arguments));
+  }
+
+  void run() {
+    CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+    CUTLASS_CHECK(gemm.run());
+  }
+
+  void benchmark(Options const& options) {
+    if (options.iterations > 0)
+    {
+      GpuTimer timer;
+      timer.start();
+      for (int iter = 0; iter < options.iterations; ++iter) {
+        run();
+      }
+      timer.stop();
+
+      // Compute average runtime and GFLOPs.
+      float elapsed_ms = timer.elapsed_millis();
+      double avg_runtime_ms = double(elapsed_ms) / double(options.iterations);
+      double gflops = options.gflops(avg_runtime_ms / 1000.0);
+
+      std::cout << "  Avg runtime: " << avg_runtime_ms << " ms" << std::endl;
+      std::cout << "  GFLOPS: " << gflops << std::endl;
+    }
+  }
+
+  Gemm gemm;
+  Arguments arguments;
+  cutlass::device_memory::allocation<uint8_t> workspace;
+};
+
+/// Execute the example (verification and timing)
+void run(Options &options) {
+  bool init = initialize(options);
+  if (!init) {
+    std::cout << "Initialization failure" << std::endl;
+    exit(EXIT_FAILURE);
+  }
+
+  Runner<Gemm> gemm(make_args(options));
+  Runner<GemmRef> gemm_ref(make_args_ref(options));
+
+  gemm.run();
+  gemm_ref.run();
+
+  bool passed = verify(options);
+
+  std::cout << "  Problem Size: " << options.m << 'x' << options.n << 'x' << options.k << std::endl;
+  std::cout << "  Disposition: " << (passed ? "Passed" : "Failed") << std::endl;
+
+  if (!passed) {
+    exit(EXIT_FAILURE);
+  }
+
+  std::cout << "Sparse GEMM:" << std::endl;
+  gemm.benchmark(options);
+
+  std::cout << "Dense GEMM:" << std::endl;
+  gemm_ref.benchmark(options);
+}
+
+#endif // defined(CUTLASS_ARCH_MMA_SPARSE_SM90_SUPPORTED)
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+int main(int argc, char const **args) {
+
+  // CUTLASS must be compiled with CUDA 12.2 Toolkit to run this example
+  // and must have compute capability at least 90.
+  if (__CUDACC_VER_MAJOR__ < 12 || (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ < 2)) {
+    std::cerr << "This example requires CUDA 12.2 or newer.\n";
+    // Returning zero so this test passes on older Toolkits. Its actions are no-op.
+    return 0;
+  }
+
+  cudaDeviceProp props;
+  int current_device_id;
+  CUDA_CHECK(cudaGetDevice(&current_device_id));
+  CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
+  cudaError_t error = cudaGetDeviceProperties(&props, 0);
+  if (props.major < 9) {
+    std::cerr
+      << "This example requires a GPU of NVIDIA's Hopper Architecture or "
+      << "later (compute capability 90 or greater).\n";
+    return 0;
+  }
+  //
+  // Parse options
+  //
+
+  Options options;
+
+  options.parse(argc, args);
+
+  if (options.help) {
+    options.print_usage(std::cout) << std::endl;
+    return 0;
+  }
+
+  //
+  // Evaluate CUTLASS kernels
+  //
+
+#if defined(CUTLASS_ARCH_MMA_SPARSE_SM90_SUPPORTED)
+  run(options);
+#endif
+
+  return EXIT_SUCCESS;
+}
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/examples/62_hopper_sparse_gemm/CMakeLists.txt b/examples/62_hopper_sparse_gemm/CMakeLists.txt
new file mode 100644
index 0000000000..cf55da4552
--- /dev/null
+++ b/examples/62_hopper_sparse_gemm/CMakeLists.txt
@@ -0,0 +1,36 @@
+
+# Copyright (c) 2024 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+# Sparse kernel in this example triggers an ICE in gcc 7.5
+if (NOT (CMAKE_CXX_COMPILER_ID STREQUAL "GNU" AND CMAKE_CXX_COMPILER_VERSION VERSION_LESS 8.0))
+cutlass_example_add_executable(
+  62_hopper_sparse_gemm
+  62_hopper_sparse_gemm.cu
+  )
+endif()
diff --git a/examples/63_hopper_gemm_with_weight_prefetch/63_hopper_gemm_with_weight_prefetch.cu b/examples/63_hopper_gemm_with_weight_prefetch/63_hopper_gemm_with_weight_prefetch.cu
new file mode 100644
index 0000000000..03c54a8ee9
--- /dev/null
+++ b/examples/63_hopper_gemm_with_weight_prefetch/63_hopper_gemm_with_weight_prefetch.cu
@@ -0,0 +1,500 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+/*! \file
+    \brief Hopper FP8 GEMM + L2 Weight Prefetch
+
+    This example implements a non-persistent warp-specialized GEMM kernel for the Hopper
+    architecture with programmatic dependent launch (PDL) enabling prefetching weights into
+    L2 cache.
+    
+    For more information about dependent launch refer to the CUDA programming guide:
+    https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#programmatic-dependent-launch-and-synchronization
+
+    In some cases, PDL can result in a window where a previous kernel is not actively utilizing 
+    DRAM, and the next kernel sits idle until the previous finishes. During this window, the next
+    kernel can begin loading a non-dependent operand (i.e. weights in a linear projection are
+    typically static) and cache it in L2.
+
+    The kernel and collective mainloop assume operand `A` corresponds to weights and operand `B`
+    corresponds to activations (so we can have very small batch/token count).
+    After initialization, the prefetch warp starts loading K tiles of `A` into an unused portion 
+    of shared memory, and loads up to half of all K tiles that the same CTA would eventually load.
+    The exact number of K tiles loaded is determined by `args.mainloop.prefetch_ratio` \in 
+    [0.0, 1.0]. Smaller values result in less prefetching, and larger values result in more.
+    Negative values result in a "best-effort" prefetch, meaning prefetcher will stop issuing weight
+    loads as soon as the activation DMA warp starts loading (as soon as it is signaled that the 
+    previous kernel has flushed its memory.)
+
+    The DMA warp responsible for loading `A` will also begin loading K tiles until it fills up
+    the available shared memory.
+    The DMA warp responsible for loading `B` will wait until activations are flushed to global 
+    memory by the preceding kernel.
+
+    Another mainloop parameter, `args.mainloop.overlap_ratio` \in [0.0, 1.0] determines how early 
+    the next kernel (the one doing the prefetch) is launched. Smaller values result in greater 
+    overlap, and larger values result in smaller overlap. Negative values disable PDL completely,
+    meaning there will be no overlap. This will make prefetch ineffective.
+
+    These two runtime parameters should be tuned per problem size and GEMM config combination, and
+    if feasible, per-operation in an entire layer or model.
+
+    NOTE: you must build this target with the following flag to enable Grid Dependency Control
+    instructions (GDC) in CUTLASS:
+      - CUTLASS_ENABLE_GDC_FOR_SM90
+
+    To lock persistence mode, power (350W), clocks (1005MHz) for evaluation (assumes device 0 and H100)
+
+      $ sudo nvidia-smi -pm 1 -i 0
+
+      $ sudo nvidia-smi -i 0 -pl 350
+
+      $ sudo nvidia-smi -i 0 -lgc 1005
+
+    Example:
+
+      $ mkdir build && cd build
+
+      $ cmake .. -DCUTLASS_NVCC_ARCHS="90a" -DCUTLASS_ENABLE_GDC_FOR_SM90=1
+
+      $ cd examples/63_hopper_gemm_with_weight_prefetch
+
+      $ make
+
+      $ ./63_hopper_gemm_with_weight_prefetch --p=0.5 --o=0.5
+*/
+
+#include <iostream>
+
+#include "cutlass/cutlass.h"
+#include "cutlass/numeric_types.h"
+
+#include "cute/tensor.hpp"
+#include "cutlass/tensor_ref.h"
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/epilogue/dispatch_policy.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+
+#include "cutlass/util/command_line.h"
+#include "cutlass/util/distribution.h"
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/packed_stride.hpp"
+#include "cutlass/util/tensor_view_io.h"
+#include "cutlass/util/reference/host/tensor_fill.h"
+#include "cutlass/util/reference/host/tensor_copy.h"
+#include "cutlass/util/reference/host/tensor_compare.h"
+#include "cutlass/util/reference/host/tensor_norm.h"
+#include "cutlass/util/reference/host/gett.hpp"
+
+
+#include "collective/dispatch_policy_extra.hpp"
+#include "collective/builder.hpp"
+#include "kernel/sm90_gemm_tma_warpspecialized_with_prefetch.hpp"
+
+#include "helper.h"
+#include "gemm_with_weight_prefetch_commandline.hpp"
+
+using namespace cute;
+
+#if defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM kernel configurations
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// A matrix configuration
+using         ElementA    = cutlass::float_e4m3_t;                          // Element type for A matrix operand
+using         LayoutA     = cutlass::layout::RowMajor;                      // Layout type for A matrix operand
+constexpr int AlignmentA  = 128 / cutlass::sizeof_bits<ElementA>::value;    // Memory access granularity/alignment of A matrix in units of elements (up to 16 bytes)
+
+// B matrix configuration
+using         ElementB    = cutlass::float_e5m2_t;                          // Element type for B matrix operand
+using         LayoutB     = cutlass::layout::ColumnMajor;                   // Layout type for B matrix operand
+constexpr int AlignmentB  = 128 / cutlass::sizeof_bits<ElementB>::value;    // Memory access granularity/alignment of B matrix in units of elements (up to 16 bytes)
+
+// C matrix configuration
+using         ElementC    = cutlass::float_e4m3_t;                          // Element type for C and D matrix operands
+using         LayoutC     = cutlass::layout::ColumnMajor;                   // Layout type for C and D matrix operands
+constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;    // Memory access granularity/alignment of C matrix in units of elements (up to 16 bytes)
+
+// D matrix configuration
+using         ElementD    = ElementC;
+using         LayoutD     = LayoutC;
+constexpr int AlignmentD  = AlignmentC;
+
+// Core kernel configurations
+using ElementAccumulator  = float;                                          // Element type for internal accumulation
+using ElementCompute      = float;                                          // Element type for epilogue computation
+using ArchTag             = cutlass::arch::Sm90;                            // Tag indicating the minimum SM that supports the intended feature
+using OperatorClass       = cutlass::arch::OpClassTensorOp;                 // Operator class tag
+using TileShape           = Shape<_64,_64,_128>;                            // Threadblock-level tile size
+// Cluster_N > 1 is not supported yet.
+using ClusterShape        = Shape<_1,_1,_1>;                                // Shape of the threadblocks in a cluster
+using KernelSchedule      = cutlass::gemm::KernelTmaWarpSpecializedFP8FastAccumWithPrefetchAndSplitDMA;
+using EpilogueSchedule    = cutlass::epilogue::TmaWarpSpecialized;
+using EpilogueTileType    = cutlass::epilogue::collective::EpilogueTileAuto;
+
+using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    TileShape, ClusterShape,
+    EpilogueTileType,
+    ElementAccumulator, ElementCompute,
+    ElementC, LayoutC, AlignmentC,
+    ElementD, LayoutD, AlignmentD,
+    EpilogueSchedule
+  >::CollectiveOp;
+
+using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    ElementA, LayoutA, AlignmentA,
+    ElementB, LayoutB, AlignmentB,
+    ElementAccumulator,
+    TileShape, ClusterShape,
+    cutlass::gemm::collective::StageCountAutoCarveout<
+      static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))
+    >,
+    KernelSchedule
+  >::CollectiveOp;
+
+using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+    Shape<int,int,int,int>, // Indicates ProblemShape
+    CollectiveMainloop,
+    CollectiveEpilogue
+>;
+
+using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+
+// Extract information from Gemm kernel.
+using EpilogueOutputOp  = typename Gemm::EpilogueOutputOp;
+using ElementScalar     = typename EpilogueOutputOp::ElementScalar;
+
+using StrideA = typename Gemm::GemmKernel::StrideA;
+using StrideB = typename Gemm::GemmKernel::StrideB;
+using StrideC = typename Gemm::GemmKernel::StrideC;
+using StrideD = typename Gemm::GemmKernel::StrideD;
+
+/// Initialization
+StrideA stride_A;
+StrideB stride_B;
+StrideC stride_C;
+StrideD stride_D;
+uint64_t seed;
+
+cutlass::HostTensor<ElementA  , LayoutA  > tensor_A;
+cutlass::HostTensor<ElementB  , LayoutB  > tensor_B;
+cutlass::HostTensor<ElementC  , LayoutC  > tensor_C;
+cutlass::HostTensor<ElementD  , LayoutD  > tensor_D;
+cutlass::HostTensor<ElementD  , LayoutD  > tensor_ref_D;
+
+using LayoutScalar = cutlass::layout::PackedVectorLayout;
+cutlass::HostTensor<ElementScalar, LayoutScalar> scalar_alpha;
+cutlass::HostTensor<ElementScalar, LayoutScalar> scalar_beta;
+
+#endif // defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Testbed utility types
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Result structure
+struct Result
+{
+  double avg_runtime_ms;
+  double gflops;
+  double eff_bw;
+  cutlass::Status status;
+  cudaError_t error;
+  bool passed;
+
+  Result(
+    double avg_runtime_ms = 0,
+    double gflops = 0,
+    double eff_bw = 0,
+    cutlass::Status status = cutlass::Status::kSuccess,
+    cudaError_t error = cudaSuccess)
+  :
+    avg_runtime_ms(avg_runtime_ms), gflops(gflops), eff_bw(eff_bw), status(status), error(error), passed(false)
+  {}
+
+};
+
+#if defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// GEMM setup and evaluation
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Helper to initialize a block of device data
+template <typename Element, typename Layout>
+bool initialize_tensor(
+  cutlass::TensorView<Element, Layout> view,
+  uint64_t seed) {
+
+  double scope_max, scope_min;
+  int bits_input = cutlass::sizeof_bits<Element>::value;
+  int bits_output = cutlass::sizeof_bits<Element>::value;
+
+  if (bits_input == 1) {
+    scope_max = 2;
+    scope_min = 0;
+  }
+  else if (bits_input <= 8) {
+    scope_max = 2;
+    scope_min = -2;
+  }
+  else if (bits_output == 16) {
+    scope_max = 5;
+    scope_min = -5;
+  }
+  else {
+    scope_max = 8;
+    scope_min = -8;
+  }
+  cutlass::reference::host::TensorFillRandomUniform(
+    view, seed, scope_max, scope_min, 0);
+
+  return true;
+}
+
+/// Initialize operands to be used in the GEMM and reference GEMM
+void initialize(const Options &options) {
+
+  stride_A = cutlass::make_cute_packed_stride(StrideA{}, cute::make_shape(options.m, options.k, options.l));
+  stride_B = cutlass::make_cute_packed_stride(StrideB{}, cute::make_shape(options.n, options.k, options.l));
+  stride_C = cutlass::make_cute_packed_stride(StrideC{}, cute::make_shape(options.m, options.n, options.l));
+  stride_D = cutlass::make_cute_packed_stride(StrideD{}, cute::make_shape(options.m, options.n, options.l));
+
+  auto a_coord = cutlass::make_Coord(options.m * options.l, options.k);
+  auto c_coord = cutlass::make_Coord(options.m * options.l, options.n);
+  auto b_coord = cutlass::make_Coord(options.k, options.n * options.l);
+
+  tensor_A.resize(a_coord);
+  tensor_B.resize(b_coord);
+  tensor_C.resize(c_coord);
+  tensor_D.resize(c_coord);
+  tensor_ref_D.resize(c_coord);
+
+  initialize_tensor(tensor_A.host_view(), seed + 2022);
+  initialize_tensor(tensor_B.host_view(), seed + 2023);
+  initialize_tensor(tensor_C.host_view(), seed + 2024);
+
+  tensor_A.sync_device();
+  tensor_B.sync_device();
+  tensor_C.sync_device();
+  tensor_D.sync_device();
+}
+
+/// Populates a Gemm::Arguments structure from the given commandline options
+typename Gemm::Arguments args_from_options(const Options &options)
+{
+  typename Gemm::Arguments arguments{
+    cutlass::gemm::GemmUniversalMode::kGemm,
+    {options.m, options.n, options.k, options.l},
+    {tensor_A.device_data(), stride_A, tensor_B.device_data(), stride_B},
+    {
+      {}, // epilogue.thread
+      tensor_C.device_data(), stride_C,
+      tensor_D.device_data(), stride_D
+    }
+  };
+
+  auto &fusion_args = arguments.epilogue.thread;
+  fusion_args.alpha = options.alpha;
+  fusion_args.beta = options.beta;
+  fusion_args.alpha_ptr = scalar_alpha.device_data();
+  fusion_args.beta_ptr = scalar_beta.device_data();
+
+  arguments.mainloop.overlap_ratio = options.overlap_ratio;
+  arguments.mainloop.prefetch_ratio = options.prefetch_ratio;
+
+  return arguments;
+}
+
+bool verify(const Options &options) {
+  //
+  // Compute reference output
+  //
+
+  // Create instantiation for device reference gemm kernel
+  auto A = cute::make_tensor(tensor_A.host_data(),
+      cute::make_layout(cute::make_shape(options.m, options.k, options.l), stride_A));
+  auto B = cute::make_tensor(tensor_B.host_data(),
+      cute::make_layout(cute::make_shape(options.n, options.k, options.l), stride_B));
+  auto C = cute::make_tensor(tensor_C.host_data(),
+      cute::make_layout(cute::make_shape(options.m, options.n, options.l), stride_C));
+  auto D = cute::make_tensor(tensor_ref_D.host_data(),
+      cute::make_layout(cute::make_shape(options.m, options.n, options.l), stride_D));
+  using unused_t = decltype(D);
+
+  cutlass::reference::host::GettMainloopParams<ElementAccumulator, decltype(A), decltype(B)> mainloop_params{A, B};
+
+  cutlass::reference::host::GettEpilogueParams<
+      ElementScalar,
+      ElementScalar,
+      ElementAccumulator,
+      ElementCompute,
+      decltype(C),
+      decltype(D),
+      unused_t, // bias
+      unused_t, // aux
+      unused_t, // valpha
+      unused_t  // vbeta
+  > epilogue_params;
+
+  epilogue_params.C = C;
+  epilogue_params.D = D;
+  epilogue_params.alpha = options.alpha;
+  epilogue_params.beta = options.beta;
+
+  // get reference result
+  cutlass::reference::host::Gemm3x(mainloop_params, epilogue_params);
+
+  // compare_reference
+  tensor_D.sync_host();
+  bool passed = cutlass::reference::host::TensorEquals(tensor_ref_D.host_view(), tensor_D.host_view());
+
+  return passed;
+}
+
+/// Execute a given example GEMM computation
+template <typename Gemm>
+int run(Options &options)
+{
+  initialize(options);
+
+  // Instantiate CUTLASS kernel depending on templates
+  Gemm gemm;
+
+  // Create a structure of gemm kernel arguments suitable for invoking an instance of Gemm
+  auto arguments = args_from_options(options);
+
+  // Using the arguments, query for extra workspace required for matrix multiplication computation
+  size_t workspace_size = Gemm::get_workspace_size(arguments);
+
+  // Allocate workspace memory
+  cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+
+  // Check if the problem size is supported or not
+  CUTLASS_CHECK(gemm.can_implement(arguments));
+
+  // Initialize CUTLASS kernel with arguments and workspace pointer
+  CUTLASS_CHECK(gemm.initialize(arguments, workspace.get()));
+
+  // Correctness / Warmup iteration
+  CUTLASS_CHECK(gemm.run(nullptr, nullptr, /* launch_with_pdl = */ options.overlap_ratio >= 0));
+
+  // Check if output from CUTLASS kernel and reference kernel are equal or not
+  Result result;
+  result.passed = verify(options);
+
+  std::cout << "  Disposition: " << (result.passed ? "Passed" : "Failed") << std::endl;
+
+  if (!result.passed) {
+    exit(-1);
+  }
+
+  // Run profiling loop
+  if (options.iterations > 0)
+  {
+    GpuTimer timer;
+    timer.start();
+    for (int iter = 0; iter < options.iterations; ++iter) {
+      CUTLASS_CHECK(gemm.run(nullptr, nullptr, /* launch_with_pdl = */ options.overlap_ratio >= 0));
+    }
+    timer.stop();
+
+    // Compute average runtime and GFLOPs.
+    float elapsed_ms = timer.elapsed_millis();
+    result.avg_runtime_ms = double(elapsed_ms) / double(options.iterations);
+    double avg_runtime_s = (double)(result.avg_runtime_ms / 1000.0);
+    result.gflops = options.gflops(avg_runtime_s);
+    result.eff_bw = options.effective_bandwidth(avg_runtime_s, sizeof(ElementA), sizeof(ElementB), sizeof(ElementC), sizeof(ElementD));
+
+    std::cout << "  Problem Size: " << options.m << 'x' << options.n << 'x' << options.k << 'x' << options.l << std::endl;
+    std::cout << "  Avg runtime: " << result.avg_runtime_ms << " ms" << std::endl;
+    std::cout << "  GFLOPS: " << result.gflops << std::endl;
+    std::cout << "  Effective bandwidth: " << result.eff_bw << " GB/s" << std::endl;
+  }
+
+  return 0;
+}
+
+#endif // defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+int main(int argc, char const **args) {
+
+  // CUTLASS must be compiled with CUDA 12.0 Toolkit to run this example
+  // and must have compute capability at least 90.
+  if (__CUDACC_VER_MAJOR__ < 12) {
+    std::cerr << "This example requires CUDA 12 or newer.\n";
+    // Returning zero so this test passes on older Toolkits. Its actions are no-op.
+    return 0;
+  }
+
+  cudaDeviceProp props;
+  int current_device_id;
+  CUDA_CHECK(cudaGetDevice(&current_device_id));
+  CUDA_CHECK(cudaGetDeviceProperties(&props, current_device_id));
+  cudaError_t error = cudaGetDeviceProperties(&props, 0);
+  if (props.major < 9) {
+    std::cerr
+      << "This example requires a GPU of NVIDIA's Hopper Architecture or "
+      << "later (compute capability 90 or greater).\n";
+    return 0;
+  }
+  //
+  // Parse options
+  //
+
+  Options options;
+
+  options.parse(argc, args);
+
+  if (options.help) {
+    options.print_usage(std::cout) << std::endl;
+    return 0;
+  }
+
+  //
+  // Evaluate CUTLASS kernels
+  //
+
+#if defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+  run<Gemm>(options);
+#endif
+
+  return 0;
+}
diff --git a/examples/63_hopper_gemm_with_weight_prefetch/CMakeLists.txt b/examples/63_hopper_gemm_with_weight_prefetch/CMakeLists.txt
new file mode 100644
index 0000000000..f48673241a
--- /dev/null
+++ b/examples/63_hopper_gemm_with_weight_prefetch/CMakeLists.txt
@@ -0,0 +1,36 @@
+# Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+include_directories(
+  .
+)
+
+cutlass_example_add_executable(
+  63_hopper_gemm_with_weight_prefetch
+  63_hopper_gemm_with_weight_prefetch.cu
+  )
diff --git a/examples/63_hopper_gemm_with_weight_prefetch/README.md b/examples/63_hopper_gemm_with_weight_prefetch/README.md
new file mode 100644
index 0000000000..5dac1cc6c2
--- /dev/null
+++ b/examples/63_hopper_gemm_with_weight_prefetch/README.md
@@ -0,0 +1,82 @@
+# GEMM with L2 weight prefetch
+
+A non-persistent warp specialized GEMM directed at low latency inference.
+
+The kernel can optionally prefetch a portion of weights (operand `A`) into L2 cache while the 
+rest of the warps are waiting on the previous kernel to finish writing and flush its memory.
+An example of this is normalization or reduction kernels that are immediately followed by a GEMM.
+
+It exposes two runtime parameters:
+1. `overlap_ratio`: how early `griddepcontrol.launch_dependent_grids` is issued. 
+   Default is `0.5`, meaning after approximately half of K tiles are loaded by DMA warps.
+2. `prefetch_ratio`: what percentage of K tiles to prefetch. 
+   Default is `-1.0`, meaning prefetching will stop as soon as other DMA warps are past
+   `griddepcontrol`.
+
+It is highly recommended to auto-tune these parameters per GEMM and according to some end to end 
+runtime (either an entire transformer layer or multiple, but probably not the entire model.)
+
+TMA loads use non-default cache hints: `A` (weights) are loaded with `EvictFirst`, and `B` (activation)
+is loaded with `EvictLast`.
+
+## Getting started
+To use this kernel in your own target, add this directory to your includes, and include the 
+following headers from this example:
+
+```cxx
+#include "collective/dispatch_policy_extra.hpp"
+#include "collective/builder.hpp"
+#include "kernel/sm90_gemm_tma_warpspecialized_with_prefetch.hpp"
+```
+
+And then use either one of the new kernel schedules:
+
+```cxx
+// Without separate warps for A and B
+using KernelSchedule = cutlass::gemm::KernelTmaWarpSpecializedFP8FastAccumWithPrefetch;
+
+// With separate warps for A and B
+using KernelSchedule = cutlass::gemm::KernelTmaWarpSpecializedFP8FastAccumWithPrefetchAndSplitDMA;
+```
+
+The kernel with separate warps for A and B (
+`KernelTmaWarpSpecializedFP8FastAccumWithPrefetchAndSplitDMA`)
+is expected to be more performant than the other, especially since it allows the kernel to load 
+weights into shmem ahead of the `griddepcontrol`.
+
+As for other GEMM parameters, Thread Block Cluster larger than 1 CTA are not yet supported, and
+obviously the kernel layer implementation is warp specialized and uses the TMA, and other kernel
+layers or collectives require reimplementation.
+
+## Example
+
+Using the example is mostly straightforward.
+Just build, and run with your choice of `MNK`:
+
+```bash
+./63_hopper_gemm_with_weight_prefetch --m=8192 --n=1 --k=8192
+```
+
+You can also disable the overlap or try different overlap and prefetch ratios and see the
+difference:
+
+```bash
+echo "Without overlap and prefetch"
+./63_hopper_gemm_with_weight_prefetch --o=-1.0 --p=-1.0
+
+echo "Overlap ratio of 0.5, best effort prefetch"
+./63_hopper_gemm_with_weight_prefetch --o=0.5 --p=-1.0
+
+echo "Overlap ratio of 0.8, prefetch ratio of 0.7"
+./63_hopper_gemm_with_weight_prefetch --o=0.8 --p=0.7
+```
+
+However, note that the example still runs a single GEMM, and most of the performance improvement
+is expected in end to end applications.
+
+
+## Limitations
+* The parameter defaults are typically not good choices, especially `prefetch_ratio`. 
+  When `prefetch_ratio` is unspecified (set to `-1.0`), the prefetch warp will `try_wait` on a 
+  memory barrier before issuing every single TMA load, and in many cases this will slow down 
+  prefetching to the point of being almost ineffective.
diff --git a/examples/63_hopper_gemm_with_weight_prefetch/collective/builder.hpp b/examples/63_hopper_gemm_with_weight_prefetch/collective/builder.hpp
new file mode 100644
index 0000000000..57365a8b36
--- /dev/null
+++ b/examples/63_hopper_gemm_with_weight_prefetch/collective/builder.hpp
@@ -0,0 +1,215 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+#pragma once
+
+#include "cutlass/gemm/collective/collective_builder.hpp"
+
+#include "dispatch_policy_extra.hpp"
+#include "sm90_mma_tma_gmma_ss_warpspecialized_with_prefetch.hpp"
+
+namespace cutlass::gemm::collective {
+
+// GMMA_TMA_WS_FP8_FAST_ACCUM_SS + prefetch
+template <
+  class ElementA,
+  class GmemLayoutATag,
+  int AlignmentA,
+  class ElementB,
+  class GmemLayoutBTag,
+  int AlignmentB,
+  class ElementAccumulator,
+  class TileShape_MNK,
+  class ClusterShape_MNK,
+  class StageCountType,
+  class KernelScheduleType
+>
+struct CollectiveBuilder<
+    arch::Sm90,
+    arch::OpClassTensorOp,
+    ElementA,
+    GmemLayoutATag,
+    AlignmentA,
+    ElementB,
+    GmemLayoutBTag,
+    AlignmentB,
+    ElementAccumulator,
+    TileShape_MNK,
+    ClusterShape_MNK,
+    StageCountType,
+    KernelScheduleType,
+    cute::enable_if_t<
+      cute::is_same_v<KernelScheduleType, KernelTmaWarpSpecializedFP8FastAccumWithPrefetch>>
+> {
+  static_assert(is_static<TileShape_MNK>::value);
+  static_assert(is_static<ClusterShape_MNK>::value);
+  static_assert(detail::is_aligned<ElementA, AlignmentA, ElementB, AlignmentB, detail::tma_alignment_bytes>(),
+                "Not meet TMA alignment requirement yet\n");
+  static_assert(detail::is_input_fp8<ElementA, ElementB>(),
+                "Only FP8 datatypes are compatible with these kernel schedules\n");
+  // Dispatch TN fp8 kernels only to TMA warp specialized FP8 builder
+  static_assert(!detail::is_use_rmem_A<ElementA, GmemLayoutATag, ElementB, GmemLayoutBTag>(),
+                 "Not supported for fp8 non-TN warp specialized kernels yet\n");
+#ifndef CUTLASS_SM90_COLLECTIVE_BUILDER_SUPPORTED
+  static_assert(cutlass::detail::dependent_false<ElementA>, "Unsupported Toolkit for SM90 Collective Builder\n");
+#endif
+
+  static constexpr cute::GMMA::Major GmmaMajorA = detail::gmma_ss_tag_to_major_A<ElementA, GmemLayoutATag>();
+  static constexpr cute::GMMA::Major GmmaMajorB = detail::gmma_ss_tag_to_major_B<ElementB, GmemLayoutBTag>();
+
+  using AtomLayoutMNK = Layout<Shape<_1,_1,_1>>;
+
+  using TiledMma = decltype(cute::make_tiled_mma(cute::GMMA::ss_op_selector<
+      ElementA, ElementB, ElementAccumulator, TileShape_MNK, GmmaMajorA, GmmaMajorB>(), AtomLayoutMNK{}));
+
+  using GmemTiledCopyA = decltype(detail::sm90_cluster_shape_to_tma_atom(shape<1>(ClusterShape_MNK{})));
+  using GmemTiledCopyB = decltype(detail::sm90_cluster_shape_to_tma_atom(shape<0>(ClusterShape_MNK{})));
+
+  using SmemLayoutAtomA = decltype(detail::ss_smem_selector<
+      GmmaMajorA, ElementA, decltype(cute::get<0>(TileShape_MNK{})), decltype(cute::get<2>(TileShape_MNK{}))>());
+  using SmemLayoutAtomB = decltype(detail::ss_smem_selector<
+      GmmaMajorB, ElementB, decltype(cute::get<1>(TileShape_MNK{})), decltype(cute::get<2>(TileShape_MNK{}))>());
+
+  static constexpr int PipelineStages = detail::compute_stage_count_or_override<detail::sm90_smem_capacity_bytes,
+      ElementA, ElementB, TileShape_MNK>(StageCountType{});
+  using DispatchPolicy = MainloopSm90TmaGmmaWarpSpecializedWithPrefetch<PipelineStages, ClusterShape_MNK, KernelScheduleType>;
+
+  using SmemCopyAtomA = void;
+  using SmemCopyAtomB = void;
+
+  using CollectiveOp = CollectiveMma<
+      DispatchPolicy,
+      TileShape_MNK,
+      ElementA,
+      TagToStrideA_t<GmemLayoutATag>,
+      ElementB,
+      TagToStrideB_t<GmemLayoutBTag>,
+      TiledMma,
+      GmemTiledCopyA,
+      SmemLayoutAtomA,
+      SmemCopyAtomA,
+      cute::identity,
+      GmemTiledCopyB,
+      SmemLayoutAtomB,
+      SmemCopyAtomB,
+      cute::identity
+    >;
+};
+
+// GMMA_TMA_WS_FP8_FAST_ACCUM_SS + prefetch and split DMA warps
+template <
+  class ElementA,
+  class GmemLayoutATag,
+  int AlignmentA,
+  class ElementB,
+  class GmemLayoutBTag,
+  int AlignmentB,
+  class ElementAccumulator,
+  class TileShape_MNK,
+  class ClusterShape_MNK,
+  class StageCountType,
+  class KernelScheduleType
+>
+struct CollectiveBuilder<
+    arch::Sm90,
+    arch::OpClassTensorOp,
+    ElementA,
+    GmemLayoutATag,
+    AlignmentA,
+    ElementB,
+    GmemLayoutBTag,
+    AlignmentB,
+    ElementAccumulator,
+    TileShape_MNK,
+    ClusterShape_MNK,
+    StageCountType,
+    KernelScheduleType,
+    cute::enable_if_t<
+      cute::is_same_v<KernelScheduleType, KernelTmaWarpSpecializedFP8FastAccumWithPrefetchAndSplitDMA>>
+> {
+  static_assert(is_static<TileShape_MNK>::value);
+  static_assert(is_static<ClusterShape_MNK>::value);
+  static_assert(detail::is_aligned<ElementA, AlignmentA, ElementB, AlignmentB, detail::tma_alignment_bytes>(),
+                "Not meet TMA alignment requirement yet\n");
+  static_assert(detail::is_input_fp8<ElementA, ElementB>(),
+                "Only FP8 datatypes are compatible with these kernel schedules\n");
+  // Dispatch TN fp8 kernels only to TMA warp specialized FP8 builder
+  static_assert(!detail::is_use_rmem_A<ElementA, GmemLayoutATag, ElementB, GmemLayoutBTag>(),
+                 "Not supported for fp8 non-TN warp specialized kernels yet\n");
+#ifndef CUTLASS_SM90_COLLECTIVE_BUILDER_SUPPORTED
+  static_assert(cutlass::detail::dependent_false<ElementA>, "Unsupported Toolkit for SM90 Collective Builder\n");
+#endif
+
+  static constexpr cute::GMMA::Major GmmaMajorA = detail::gmma_ss_tag_to_major_A<ElementA, GmemLayoutATag>();
+  static constexpr cute::GMMA::Major GmmaMajorB = detail::gmma_ss_tag_to_major_B<ElementB, GmemLayoutBTag>();
+
+  using AtomLayoutMNK = Layout<Shape<_1,_1,_1>>;
+
+  using TiledMma = decltype(cute::make_tiled_mma(cute::GMMA::ss_op_selector<
+      ElementA, ElementB, ElementAccumulator, TileShape_MNK, GmmaMajorA, GmmaMajorB>(), AtomLayoutMNK{}));
+
+  using GmemTiledCopyA = decltype(detail::sm90_cluster_shape_to_tma_atom(shape<1>(ClusterShape_MNK{})));
+  using GmemTiledCopyB = decltype(detail::sm90_cluster_shape_to_tma_atom(shape<0>(ClusterShape_MNK{})));
+
+  using SmemLayoutAtomA = decltype(detail::ss_smem_selector<
+      GmmaMajorA, ElementA, decltype(cute::get<0>(TileShape_MNK{})), decltype(cute::get<2>(TileShape_MNK{}))>());
+  using SmemLayoutAtomB = decltype(detail::ss_smem_selector<
+      GmmaMajorB, ElementB, decltype(cute::get<1>(TileShape_MNK{})), decltype(cute::get<2>(TileShape_MNK{}))>());
+
+  static constexpr int PipelineStages = detail::compute_stage_count_or_override<detail::sm90_smem_capacity_bytes,
+      ElementA, ElementB, TileShape_MNK>(StageCountType{});
+  using DispatchPolicy = MainloopSm90TmaGmmaWarpSpecializedWithPrefetch<PipelineStages, ClusterShape_MNK, KernelScheduleType>;
+
+  using SmemCopyAtomA = void;
+  using SmemCopyAtomB = void;
+
+  using CollectiveOp = CollectiveMma<
+      DispatchPolicy,
+      TileShape_MNK,
+      ElementA,
+      TagToStrideA_t<GmemLayoutATag>,
+      ElementB,
+      TagToStrideB_t<GmemLayoutBTag>,
+      TiledMma,
+      GmemTiledCopyA,
+      SmemLayoutAtomA,
+      SmemCopyAtomA,
+      cute::identity,
+      GmemTiledCopyB,
+      SmemLayoutAtomB,
+      SmemCopyAtomB,
+      cute::identity
+    >;
+};
+
+} // namespace cutlass::gemm::collective
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/examples/63_hopper_gemm_with_weight_prefetch/collective/dispatch_policy_extra.hpp b/examples/63_hopper_gemm_with_weight_prefetch/collective/dispatch_policy_extra.hpp
new file mode 100644
index 0000000000..37369176f9
--- /dev/null
+++ b/examples/63_hopper_gemm_with_weight_prefetch/collective/dispatch_policy_extra.hpp
@@ -0,0 +1,61 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+#pragma once
+
+namespace cutlass::gemm {
+
+// Standard non-persistent kernel with a single producer warp, and one prefetch warp.
+// `A` is assumed to be static, and therefore the producer warp for `A` attempts to load `A` 
+// while the producer warp is waiting on griddepcontrol.
+// GDC `launch_dependent_grids` is issued from the producer warp instead of math warps, and 
+// according to prefetch ratio.
+struct KernelTmaWarpSpecializedFP8FastAccumWithPrefetch { };
+
+// Non-persistent kernel with two producer warps (one for each of A and B), and one prefetch warp.
+// `A` is assumed to be static, and therefore the producer warp for `A` attempts to load `A` 
+// while the producer warp for `B` is waiting on griddepcontrol. Producer warp for `A` does not 
+// wait on griddepcontrol and loads immediately.
+struct KernelTmaWarpSpecializedFP8FastAccumWithPrefetchAndSplitDMA { };
+
+template<
+  int Stages_,
+  class ClusterShape_ = Shape<_1,_1,_1>,
+  class KernelSchedule = KernelTmaWarpSpecializedFP8FastAccumWithPrefetch
+>
+struct MainloopSm90TmaGmmaWarpSpecializedWithPrefetch {
+  constexpr static int Stages = Stages_;
+  using ClusterShape = ClusterShape_;
+  using ArchTag = arch::Sm90;
+  using Schedule = KernelSchedule;
+};
+
+} // namespace cutlass::gemm
diff --git a/examples/63_hopper_gemm_with_weight_prefetch/collective/sm90_mma_tma_gmma_ss_warpspecialized_with_prefetch.hpp b/examples/63_hopper_gemm_with_weight_prefetch/collective/sm90_mma_tma_gmma_ss_warpspecialized_with_prefetch.hpp
new file mode 100644
index 0000000000..710224d78c
--- /dev/null
+++ b/examples/63_hopper_gemm_with_weight_prefetch/collective/sm90_mma_tma_gmma_ss_warpspecialized_with_prefetch.hpp
@@ -0,0 +1,867 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+#pragma once
+
+#include "cutlass/cutlass.h"
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/numeric_types.h"
+#include "cutlass/pipeline/pipeline.hpp"
+#include "cutlass/trace.h"
+
+#include "cute/arch/cluster_sm90.hpp"
+#include "cute/arch/copy_sm90.hpp"
+#include "cute/algorithm/functional.hpp"
+#include "cute/atom/mma_atom.hpp"
+#include "cute/algorithm/gemm.hpp"
+#include "cute/tensor_predicate.hpp"
+#include "cute/numeric/arithmetic_tuple.hpp"
+#include "cutlass/arch/grid_dependency_control.h"
+
+#include "dispatch_policy_extra.hpp"
+
+#include "../pipeline/prefetch_pipeline_sm90.hpp"
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+namespace cutlass::gemm::collective {
+using namespace cute;
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// WarpSpecialized Mainloop
+template <
+  int Stages,
+  class ClusterShape,
+  class KernelSchedule,
+  class TileShape_,
+  class ElementA_,
+  class StrideA_,
+  class ElementB_,
+  class StrideB_,
+  class TiledMma_,
+  class GmemTiledCopyA_,
+  class SmemLayoutAtomA_,
+  class SmemCopyAtomA_,
+  class TransformA_,
+  class GmemTiledCopyB_,
+  class SmemLayoutAtomB_,
+  class SmemCopyAtomB_,
+  class TransformB_>
+struct CollectiveMma<
+    MainloopSm90TmaGmmaWarpSpecializedWithPrefetch<Stages, ClusterShape, KernelSchedule>,
+    TileShape_,
+    ElementA_,
+    StrideA_,
+    ElementB_,
+    StrideB_,
+    TiledMma_,
+    GmemTiledCopyA_,
+    SmemLayoutAtomA_,
+    SmemCopyAtomA_,
+    TransformA_,
+    GmemTiledCopyB_,
+    SmemLayoutAtomB_,
+    SmemCopyAtomB_,
+    TransformB_>
+{
+  //
+  // Type Aliases
+  //
+  using DispatchPolicy = MainloopSm90TmaGmmaWarpSpecializedWithPrefetch<Stages, ClusterShape, KernelSchedule>;
+  using TileShape = TileShape_;
+  using ElementA = ElementA_;
+  using StrideA = StrideA_;
+  using ElementB = ElementB_;
+  using StrideB = StrideB_;
+  using TiledMma = TiledMma_;
+  using ElementAccumulator = typename TiledMma::ValTypeC;
+  using GmemTiledCopyA = GmemTiledCopyA_;
+  using GmemTiledCopyB = GmemTiledCopyB_;
+  using SmemLayoutAtomA = SmemLayoutAtomA_;
+  using SmemLayoutAtomB = SmemLayoutAtomB_;
+  using SmemCopyAtomA = SmemCopyAtomA_;
+  using SmemCopyAtomB = SmemCopyAtomB_;
+  using TransformA = TransformA_;
+  using TransformB = TransformB_;
+  using ArchTag = typename DispatchPolicy::ArchTag;
+
+  static_assert(size<1>(ClusterShape{}) == 1, "Cluster shape N must be 1");
+  using CtaShape_MNK = decltype(shape_div(TileShape{}, ClusterShape{}));
+
+  static constexpr int PrefetchStages = 4;
+  static constexpr int PrefetchInitialStages = 1;
+  // This determines how much shmem we set aside for prefetch.
+  // We don't reuse anything loaded by prefetcher, so we can keep
+  // loading into the same place -- there will be a conflict when
+  // writing, but it doesn't affect performance as much as the doors
+  // that this opens.
+  static constexpr int PrefetchStagesActual = 1;
+  using PrefetcherPipeline = cutlass::PrefetchPipeline<PrefetchStages>;
+
+  using MainloopPipeline = cutlass::PipelineTmaAsync<DispatchPolicy::Stages>;
+  using PipelineState = cutlass::PipelineState<DispatchPolicy::Stages>;
+  using PipelineParams = typename MainloopPipeline::Params;
+
+  static_assert(cute::rank(SmemLayoutAtomA{}) == 2, "SmemLayoutAtom must be rank 2 (M/N, K)");
+  static_assert((size<0>(TileShape{}) % size<0>(SmemLayoutAtomA{})) == 0, "SmemLayoutAtom must evenly divide tile shape.");
+  static_assert((size<2>(TileShape{}) % size<1>(SmemLayoutAtomA{})) == 0, "SmemLayoutAtom must evenly divide tile shape.");
+
+  static_assert(cute::rank(SmemLayoutAtomB{}) == 2, "SmemLayoutAtom must be rank 2 (M/N, K)");
+  static_assert((size<1>(TileShape{}) % size<0>(SmemLayoutAtomB{})) == 0, "SmemLayoutAtom must evenly divide tile shape.");
+  static_assert((size<2>(TileShape{}) % size<1>(SmemLayoutAtomB{})) == 0, "SmemLayoutAtom must evenly divide tile shape.");
+
+  // Tile along modes in a way that maximizes the TMA box size.
+  using SmemLayoutA = decltype(tile_to_shape(
+      SmemLayoutAtomA{},
+      make_shape(shape<0>(TileShape{}), shape<2>(TileShape{}), Int<DispatchPolicy::Stages>{}),
+      cute::conditional_t< ::cutlass::gemm::detail::is_major<0,StrideA>(), Step<_2,_1,_3>, Step<_1,_2,_3>>{}));
+  using SmemLayoutB = decltype(tile_to_shape(
+      SmemLayoutAtomB{},
+      make_shape(shape<1>(TileShape{}), shape<2>(TileShape{}), Int<DispatchPolicy::Stages>{}),
+      cute::conditional_t< ::cutlass::gemm::detail::is_major<0,StrideB>(), Step<_2,_1,_3>, Step<_1,_2,_3>>{}));
+
+  static_assert(rank(SmemLayoutA{}) == 3 && size<2>(SmemLayoutA{}) == DispatchPolicy::Stages);
+  static_assert(rank(SmemLayoutB{}) == 3 && size<2>(SmemLayoutB{}) == DispatchPolicy::Stages);
+
+  using PrefetchSmemLayoutA = decltype(make_layout(make_shape(
+    cute::Int<size<0>(SmemLayoutA{})>{},
+    cute::Int<size<1>(SmemLayoutA{})>{},
+    cute::Int<PrefetchStagesActual>{})));
+
+  static constexpr auto prefetch_smem_size = cute::cosize_v<PrefetchSmemLayoutA>;
+
+  static_assert(DispatchPolicy::Stages >= 2, "Specialization requires Stages set to value 2 or more.");
+  static_assert(cute::is_base_of<cute::GMMA::DescriptorIterator, typename TiledMma::FrgTypeA>::value &&
+                cute::is_base_of<cute::GMMA::DescriptorIterator, typename TiledMma::FrgTypeB>::value,
+                "MMA atom must source both A and B operand from smem_desc for this mainloop.");
+  static_assert(cute::is_same_v<GmemTiledCopyA, SM90_TMA_LOAD> || cute::is_same_v<GmemTiledCopyA, SM90_TMA_LOAD_MULTICAST>,
+      "GmemTiledCopy - invalid SM90 TMA copy atom specified.");
+  static_assert(cute::is_same_v<GmemTiledCopyB, SM90_TMA_LOAD> || cute::is_same_v<GmemTiledCopyB, SM90_TMA_LOAD_MULTICAST>,
+      "GmemTiledCopy - invalid SM90 TMA copy atom specified.");
+
+  // TMA converts f32 input to tf32 when copying from GMEM to SMEM
+  // For all other types, cast to size equivalent uint type to avoid any rounding by TMA.
+  static constexpr bool ConvertF32toTF32A = cute::is_same_v<float, ElementA>;
+  static constexpr bool ConvertF32toTF32B = cute::is_same_v<float, ElementB>;
+  using InternalElementA = cute::conditional_t<ConvertF32toTF32A, tfloat32_t, uint_bit_t<sizeof_bits_v<ElementA>>>;
+  using InternalElementB = cute::conditional_t<ConvertF32toTF32B, tfloat32_t, uint_bit_t<sizeof_bits_v<ElementB>>>;
+
+  // Defined outside the class where it's used, to work around MSVC issues
+  using PrefetcherPipelineStorage = ::cutlass::detail::PrefetcherPipelineSharedStorage<PrefetchStages>;
+
+  struct SharedStorage {
+    struct TensorStorage : cute::aligned_struct<128, _0> {
+      cute::array_aligned<typename TiledMma::ValTypeA, cute::cosize_v<SmemLayoutA>> smem_A;
+      cute::array_aligned<typename TiledMma::ValTypeB, cute::cosize_v<SmemLayoutB>> smem_B;
+      cute::array_aligned<typename TiledMma::ValTypeA, prefetch_smem_size> smem_prefetch;
+    } tensors;
+
+    using PipelineStorage = typename MainloopPipeline::SharedStorage;
+    PipelineStorage pipeline;
+    PrefetcherPipelineStorage prefetcher_pipeline;
+  };
+  using TensorStorage = typename SharedStorage::TensorStorage;
+  using PipelineStorage = typename SharedStorage::PipelineStorage;
+
+  // Host side kernel arguments
+  struct Arguments {
+    ElementA const* ptr_A;
+    StrideA dA;
+    ElementB const* ptr_B;
+    StrideB dB;
+    uint32_t mma_promotion_interval = 4;
+    float overlap_ratio = 0.5;
+    float prefetch_ratio = -1.0;
+  };
+
+  // Device side kernel params
+  struct Params {
+    // Assumption: StrideA is congruent with Problem_MK
+    using TMA_A = decltype(make_tma_copy_A_sm90(
+        GmemTiledCopyA{},
+        make_tensor(static_cast<InternalElementA const*>(nullptr), repeat_like(StrideA{}, int32_t(0)), StrideA{}),
+        SmemLayoutA{}(_,_,cute::Int<0>{}),
+        TileShape{},
+        ClusterShape{}));
+    // Assumption: StrideB is congruent with Problem_NK
+    using TMA_B = decltype(make_tma_copy_B_sm90(
+        GmemTiledCopyB{},
+        make_tensor(static_cast<InternalElementB const*>(nullptr), repeat_like(StrideB{}, int32_t(0)), StrideB{}),
+        SmemLayoutB{}(_,_,cute::Int<0>{}),
+        TileShape{},
+        ClusterShape{}));
+
+    TMA_A tma_load_a;
+    TMA_B tma_load_b;
+    uint32_t tma_transaction_bytes = TmaTransactionBytesMK + TmaTransactionBytesNK;
+    uint32_t tma_transaction_bytes_mk = TmaTransactionBytesMK;
+    uint32_t tma_transaction_bytes_nk = TmaTransactionBytesNK;
+    float overlap_ratio = 0.5;
+    float prefetch_ratio = -1.0;
+  };
+
+  //
+  // Methods
+  //
+
+  template <class ProblemShape>
+  static constexpr Params
+  to_underlying_arguments(ProblemShape const& problem_shape, Arguments const& args, void* workspace) {
+    (void) workspace;
+
+    // Optionally append 1s until problem shape is rank-4 (MNKL), in case it is only rank-3 (MNK)
+    auto problem_shape_MNKL = append<4>(problem_shape, 1);
+    auto [M,N,K,L] = problem_shape_MNKL;
+
+    auto ptr_A = reinterpret_cast<InternalElementA const*>(args.ptr_A);
+    auto ptr_B = reinterpret_cast<InternalElementB const*>(args.ptr_B);
+
+    Tensor tensor_a = make_tensor(ptr_A, make_layout(make_shape(M,K,L), args.dA));
+    Tensor tensor_b = make_tensor(ptr_B, make_layout(make_shape(N,K,L), args.dB));
+
+    typename Params::TMA_A tma_load_a = make_tma_copy_A_sm90(
+        GmemTiledCopyA{},
+        tensor_a,
+        SmemLayoutA{}(_,_,cute::Int<0>{}),
+        TileShape{},
+        ClusterShape{});
+    typename Params::TMA_B tma_load_b = make_tma_copy_B_sm90(
+        GmemTiledCopyB{},
+        tensor_b,
+        SmemLayoutB{}(_,_,cute::Int<0>{}),
+        TileShape{},
+        ClusterShape{});
+    uint32_t transaction_bytes_mk = TmaTransactionBytesMK;
+    uint32_t transaction_bytes_nk = TmaTransactionBytesNK;
+    uint32_t transaction_bytes = transaction_bytes_mk + transaction_bytes_nk;
+
+    return {
+      tma_load_a,
+      tma_load_b,
+      transaction_bytes,
+      transaction_bytes_mk,
+      transaction_bytes_nk,
+      args.overlap_ratio,
+      args.prefetch_ratio
+    };
+  }
+
+  template<class ProblemShape>
+  static bool
+  can_implement(
+      ProblemShape const& problem_shape,
+      [[maybe_unused]] Arguments const& args) {
+    constexpr int tma_alignment_bits = 128;
+    auto problem_shape_MNKL = append<4>(problem_shape, 1);
+    auto [M,N,K,L] = problem_shape_MNKL;
+    
+    constexpr int min_tma_aligned_elements_A = tma_alignment_bits / cutlass::sizeof_bits<ElementA>::value;
+    bool implementable = cutlass::detail::check_alignment<min_tma_aligned_elements_A>(cute::make_shape(M,K,L), StrideA{});
+    constexpr int min_tma_aligned_elements_B = tma_alignment_bits / cutlass::sizeof_bits<ElementB>::value;
+    implementable = implementable && cutlass::detail::check_alignment<min_tma_aligned_elements_B>(cute::make_shape(N,K,L), StrideB{});
+
+    if (!implementable) {
+      CUTLASS_TRACE_HOST("  CAN IMPLEMENT: Problem Size doesn't meet the minimum alignment requirements for TMA.\n");
+      return false;
+    }
+
+    if (args.overlap_ratio > 1.0) {
+      CUTLASS_TRACE_HOST("  CAN IMPLEMENT: `overlap_ratio` must be either negative (disabled) or in [0, 1].\n");
+      return false;
+    }
+
+    if (args.prefetch_ratio > 1.0) {
+      CUTLASS_TRACE_HOST("  CAN IMPLEMENT: `prefetch_ratio` must be either negative (disabled) or in [0, 1].\n");
+      return false;
+    }
+
+    return true;
+  }
+
+  static constexpr int K_PIPE_MAX = DispatchPolicy::Stages;
+  static constexpr int K_PIPE_MMAS = 1;
+  static constexpr uint32_t TmaTransactionBytesMK =
+        cutlass::bits_to_bytes(size<0>(SmemLayoutA{}) * size<1>(SmemLayoutA{}) * static_cast<uint32_t>(sizeof_bits<ElementA>::value));
+  static constexpr uint32_t TmaTransactionBytesNK =
+        cutlass::bits_to_bytes(size<0>(SmemLayoutB{}) * size<1>(SmemLayoutB{}) * static_cast<uint32_t>(sizeof_bits<ElementB>::value));
+
+  /// Issue Tma Descriptor Prefetch -- ideally from a single thread for best performance
+  CUTLASS_DEVICE
+  static void prefetch_tma_descriptors(Params const& mainloop_params) {
+    cute::prefetch_tma_descriptor(mainloop_params.tma_load_a.get_tma_descriptor());
+    cute::prefetch_tma_descriptor(mainloop_params.tma_load_b.get_tma_descriptor());
+  }
+
+  /// Set up the data needed by this collective for load and mma.
+  /// Returns a tuple of tensors. The collective and the kernel layer have the contract
+  /// Returned tuple must contain at least two elements, with the first two elements being:
+  /// gA_mkl - The tma tensor, A after a local tile so it has shape  (BLK_M,BLK_K,m,k,l)
+  /// gB_nkl - The tma tensor, B after a local tile so it has shape  (BLK_N,BLK_K,n,k,l)
+  /// The rest of the tensors can be specified as needed by this collective.
+  template <class ProblemShape_MNKL>
+  CUTLASS_DEVICE auto
+  load_init(ProblemShape_MNKL const& problem_shape_MNKL, Params const& mainloop_params) const {
+    using X = Underscore;
+    // Separate out problem shape for convenience
+    auto [M,N,K,L] = problem_shape_MNKL;
+
+    // TMA requires special handling of strides to deal with coord codomain mapping
+    // Represent the full tensors -- get these from TMA
+    Tensor mA_mkl = mainloop_params.tma_load_a.get_tma_tensor(make_shape(M,K,L));                            // (m,k,l)
+    Tensor mB_nkl = mainloop_params.tma_load_b.get_tma_tensor(make_shape(N,K,L));                            // (n,k,l)
+
+    // Make tiled views, defer the slice
+    Tensor gA_mkl = local_tile(mA_mkl, TileShape{}, make_coord(_,_,_), Step<_1, X,_1>{});        // (BLK_M,BLK_K,m,k,l)
+    Tensor gB_nkl = local_tile(mB_nkl, TileShape{}, make_coord(_,_,_), Step< X,_1,_1>{});        // (BLK_N,BLK_K,n,k,l)
+
+    return cute::make_tuple(gA_mkl, gB_nkl);
+  }
+
+  template <
+    class TensorA, class TensorB,
+    class KTileIterator, class BlockCoord
+  >
+  CUTLASS_DEVICE void
+  load(
+      Params const& mainloop_params,
+      MainloopPipeline pipeline,
+      PrefetcherPipeline prefetcher_pipeline,
+      PipelineState smem_pipe_write,
+      TensorA const& gA_mkl,
+      TensorB const& gB_nkl,
+      BlockCoord const& blk_coord,
+      KTileIterator k_tile_iter, int k_tile_count,
+      int thread_idx,
+      uint32_t block_rank_in_cluster,
+      TensorStorage& shared_tensors) {
+    int lane_predicate = cute::elect_one_sync();
+
+    if (lane_predicate) {
+      bool disable_gdc = mainloop_params.overlap_ratio < 0.0;
+      float overlap_ratio = mainloop_params.overlap_ratio;
+      int launch_dep_grids_threshold = static_cast<int>(static_cast<float>(k_tile_count - 1) * overlap_ratio);
+
+      Tensor sA = make_tensor(make_smem_ptr(shared_tensors.smem_A.data()), SmemLayoutA{});        // (BLK_M,BLK_K,PIPE)
+      Tensor sB = make_tensor(make_smem_ptr(shared_tensors.smem_B.data()), SmemLayoutB{});        // (BLK_N,BLK_K,PIPE)
+
+      //
+      // Prepare the TMA loads for A
+      //
+
+      constexpr uint32_t cluster_shape_x = get<0>(typename DispatchPolicy::ClusterShape());
+      uint2 cluster_local_block_id = {block_rank_in_cluster % cluster_shape_x, block_rank_in_cluster / cluster_shape_x};
+
+      auto cta_tma_a = mainloop_params.tma_load_a.get_slice(cluster_local_block_id.y);
+      auto cta_tma_b = mainloop_params.tma_load_b.get_slice(cluster_local_block_id.x);
+
+      // Partition the inputs based on the current block coordinates.
+      auto [m_coord, n_coord, k_coord, l_coord] = blk_coord;
+      Tensor gA = gA_mkl(_,_,m_coord,_,l_coord);                                                     // (BLK_M,BLK_K,k)
+      Tensor gB = gB_nkl(_,_,n_coord,_,l_coord);                                                     // (BLK_N,BLK_K,k)
+
+      // Applies the mapping from cta_tma_a
+      Tensor tAgA = cta_tma_a.partition_S(gA);                                                   // (TMA,TMA_M,TMA_K,k)
+      Tensor tAsA = cta_tma_a.partition_D(sA);                                                // (TMA,TMA_M,TMA_K,PIPE)
+
+      // Applies the mapping from cta_tma_b
+      Tensor tBgB = cta_tma_b.partition_S(gB);                                                   // (TMA,TMA_N,TMA_K,k)
+      Tensor tBsB = cta_tma_b.partition_D(sB);                                                // (TMA,TMA_N,TMA_K,PIPE)
+
+      uint16_t mcast_mask_a = 0;
+      uint16_t mcast_mask_b = 0;
+
+      // Issue TmaLoads
+      // Maps the tile -> block, value
+      if constexpr (cute::is_same_v<GmemTiledCopyA, SM90_TMA_LOAD_MULTICAST>) {
+        auto block_layout = Layout<typename DispatchPolicy::ClusterShape>{}; // (m,n) -> block_id
+        for (int n = 0; n < size<1>(block_layout); ++n) {
+          mcast_mask_a |= (uint16_t(1) << block_layout(cluster_local_block_id.x,n,Int<0>{}));
+        }
+      }
+
+      if constexpr (cute::is_same_v<GmemTiledCopyB, SM90_TMA_LOAD_MULTICAST>) {
+        auto block_layout = Layout<typename DispatchPolicy::ClusterShape>{};                       // (m,n) -> block_id
+        for (int m = 0; m < size<0>(block_layout); ++m) {
+          mcast_mask_b |= (uint16_t(1) << block_layout(m,cluster_local_block_id.y,Int<0>{}));
+        }
+      }
+
+      // We have to wait on dependent grids because of B.
+      cutlass::arch::wait_on_dependent_grids();
+
+      // Signal prefetcher to stop
+      prefetcher_pipeline.producer_arrive();
+
+      bool launch_dep_grids = false;
+      // Mainloop
+      CUTLASS_PRAGMA_NO_UNROLL
+      for (int cnt=0 ; k_tile_count > 0; --k_tile_count, ++cnt) {
+        // LOCK smem_pipe_write for _writing_
+        pipeline.producer_acquire(smem_pipe_write);
+
+        //
+        // Copy gmem to smem for *k_tile_iter
+        //
+
+        using BarrierType = typename MainloopPipeline::ProducerBarrierType;
+        BarrierType* tma_barrier = pipeline.producer_get_barrier(smem_pipe_write);
+
+        int write_stage = smem_pipe_write.index();
+        copy(mainloop_params.tma_load_a.with(*tma_barrier, mcast_mask_a, cute::TMA::CacheHintSm90::EVICT_FIRST), tAgA(_,_,_,*k_tile_iter), tAsA(_,_,_,write_stage));
+        copy(mainloop_params.tma_load_b.with(*tma_barrier, mcast_mask_b, cute::TMA::CacheHintSm90::EVICT_LAST), tBgB(_,_,_,*k_tile_iter), tBsB(_,_,_,write_stage));
+        ++k_tile_iter;
+
+        if (!disable_gdc && cnt >= launch_dep_grids_threshold && !launch_dep_grids) { 
+          launch_dep_grids = true;
+          cutlass::arch::launch_dependent_grids();
+        }
+
+        // Advance smem_pipe_write
+        ++smem_pipe_write;
+      }
+      if (!disable_gdc && !launch_dep_grids) { 
+        cutlass::arch::launch_dependent_grids();
+      }
+    }
+  }
+
+  template <
+    class TensorA,
+    class KTileIterator, class BlockCoord
+  >
+  CUTLASS_DEVICE void
+  load_MK(
+      Params const& mainloop_params,
+      MainloopPipeline pipeline,
+      PrefetcherPipeline prefetcher_pipeline,
+      PipelineState smem_pipe_write,
+      TensorA const& gA_mkl,
+      BlockCoord const& blk_coord,
+      KTileIterator k_tile_iter, int k_tile_count,
+      int thread_idx,
+      uint32_t block_rank_in_cluster,
+      TensorStorage& shared_tensors) {
+    int lane_predicate = cute::elect_one_sync();
+
+    if (lane_predicate) {
+      bool disable_gdc = mainloop_params.overlap_ratio < 0.0;
+      float overlap_ratio = mainloop_params.overlap_ratio;
+      int launch_dep_grids_threshold = static_cast<int>(static_cast<float>(k_tile_count - 1) * overlap_ratio);
+
+      Tensor sA = make_tensor(make_smem_ptr(shared_tensors.smem_A.data()), SmemLayoutA{});        // (BLK_M,BLK_K,PIPE)
+
+      //
+      // Prepare the TMA loads for A
+      //
+
+      constexpr uint32_t cluster_shape_x = get<0>(typename DispatchPolicy::ClusterShape());
+      uint2 cluster_local_block_id = {block_rank_in_cluster % cluster_shape_x, block_rank_in_cluster / cluster_shape_x};
+
+      auto cta_tma_a = mainloop_params.tma_load_a.get_slice(cluster_local_block_id.y);
+
+      // Partition the inputs based on the current block coordinates.
+      auto [m_coord, n_coord, k_coord, l_coord] = blk_coord;
+      Tensor gA = gA_mkl(_,_,m_coord,_,l_coord);                                                     // (BLK_M,BLK_K,k)
+
+      // Applies the mapping from cta_tma_a
+      Tensor tAgA = cta_tma_a.partition_S(gA);                                                   // (TMA,TMA_M,TMA_K,k)
+      Tensor tAsA = cta_tma_a.partition_D(sA);                                                // (TMA,TMA_M,TMA_K,PIPE)
+
+      uint16_t mcast_mask_a = 0;
+
+      // Issue TmaLoads
+      // Maps the tile -> block, value
+      if constexpr (cute::is_same_v<GmemTiledCopyA, SM90_TMA_LOAD_MULTICAST>) {
+        auto block_layout = Layout<typename DispatchPolicy::ClusterShape>{}; // (m,n) -> block_id
+        for (int n = 0; n < size<1>(block_layout); ++n) {
+          mcast_mask_a |= (uint16_t(1) << block_layout(cluster_local_block_id.x,n,Int<0>{}));
+        }
+      }
+
+      // Don't wait on dependent grids when loading `A`, because
+      // we assume `A` (weights) are static.
+
+      bool launch_dep_grids = false;
+      // Mainloop
+      CUTLASS_PRAGMA_NO_UNROLL
+      for (int cnt=0 ; k_tile_count > 0; --k_tile_count, ++cnt) {
+        // LOCK smem_pipe_write for _writing_
+        pipeline.producer_acquire(smem_pipe_write);
+
+        //
+        // Copy gmem to smem for *k_tile_iter
+        //
+
+        using BarrierType = typename MainloopPipeline::ProducerBarrierType;
+        BarrierType* tma_barrier = pipeline.producer_get_barrier(smem_pipe_write);
+
+        int write_stage = smem_pipe_write.index();
+        copy(mainloop_params.tma_load_a.with(*tma_barrier, mcast_mask_a, cute::TMA::CacheHintSm90::EVICT_FIRST), tAgA(_,_,_,*k_tile_iter), tAsA(_,_,_,write_stage));
+        ++k_tile_iter;
+
+        if (!disable_gdc && cnt >= launch_dep_grids_threshold && !launch_dep_grids) { 
+          launch_dep_grids = true;
+          cutlass::arch::launch_dependent_grids();
+        }
+
+        // Advance smem_pipe_write
+        ++smem_pipe_write;
+      }
+      if (!disable_gdc && !launch_dep_grids) { 
+        cutlass::arch::launch_dependent_grids();
+      }
+    }
+  }
+
+  template <
+    class TensorB,
+    class KTileIterator, class BlockCoord
+  >
+  CUTLASS_DEVICE void
+  load_NK(
+      Params const& mainloop_params,
+      MainloopPipeline pipeline,
+      PrefetcherPipeline prefetcher_pipeline,
+      PipelineState smem_pipe_write,
+      TensorB const& gB_nkl,
+      BlockCoord const& blk_coord,
+      KTileIterator k_tile_iter, int k_tile_count,
+      int thread_idx,
+      uint32_t block_rank_in_cluster,
+      TensorStorage& shared_tensors) {
+    int lane_predicate = cute::elect_one_sync();
+
+    if (lane_predicate) {
+      Tensor sB = make_tensor(make_smem_ptr(shared_tensors.smem_B.data()), SmemLayoutB{});        // (BLK_N,BLK_K,PIPE)
+
+      //
+      // Prepare the TMA loads for B
+      //
+
+      constexpr uint32_t cluster_shape_x = get<0>(typename DispatchPolicy::ClusterShape());
+      uint2 cluster_local_block_id = {block_rank_in_cluster % cluster_shape_x, block_rank_in_cluster / cluster_shape_x};
+
+      auto cta_tma_b = mainloop_params.tma_load_b.get_slice(cluster_local_block_id.x);
+
+      // Partition the inputs based on the current block coordinates.
+      auto [m_coord, n_coord, k_coord, l_coord] = blk_coord;
+      Tensor gB = gB_nkl(_,_,n_coord,_,l_coord);                                                     // (BLK_N,BLK_K,k)
+
+      // Applies the mapping from cta_tma_b
+      Tensor tBgB = cta_tma_b.partition_S(gB);                                                   // (TMA,TMA_N,TMA_K,k)
+      Tensor tBsB = cta_tma_b.partition_D(sB);                                                // (TMA,TMA_N,TMA_K,PIPE)
+
+      uint16_t mcast_mask_b = 0;
+
+      // Issue TmaLoads
+      // Maps the tile -> block, value
+      if constexpr (cute::is_same_v<GmemTiledCopyB, SM90_TMA_LOAD_MULTICAST>) {
+        auto block_layout = Layout<typename DispatchPolicy::ClusterShape>{};                       // (m,n) -> block_id
+        for (int m = 0; m < size<0>(block_layout); ++m) {
+          mcast_mask_b |= (uint16_t(1) << block_layout(m,cluster_local_block_id.y,Int<0>{}));
+        }
+      }
+
+      // Ensure that the prefetched kernel does not touch
+      // unflushed global memory prior to this instruction
+      cutlass::arch::wait_on_dependent_grids();
+
+      // Signal prefetcher to stop
+      prefetcher_pipeline.producer_arrive();
+
+      // Mainloop
+      CUTLASS_PRAGMA_NO_UNROLL
+      for (; k_tile_count > 0; --k_tile_count) {
+        // LOCK smem_pipe_write for _writing_
+        pipeline.producer_acquire(smem_pipe_write);
+
+        //
+        // Copy gmem to smem for *k_tile_iter
+        //
+
+        using BarrierType = typename MainloopPipeline::ProducerBarrierType;
+        BarrierType* tma_barrier = pipeline.producer_get_barrier(smem_pipe_write);
+
+        int write_stage = smem_pipe_write.index();
+        copy(mainloop_params.tma_load_b.with(*tma_barrier, mcast_mask_b, cute::TMA::CacheHintSm90::EVICT_LAST), tBgB(_,_,_,*k_tile_iter), tBsB(_,_,_,write_stage));
+        ++k_tile_iter;
+
+        // Advance smem_pipe_write
+        ++smem_pipe_write;
+      }
+    }
+  }
+
+  /// Perform a Producer Epilogue to prevent early exit of blocks in a Cluster
+  CUTLASS_DEVICE void
+  load_tail(MainloopPipeline pipeline, PipelineState smem_pipe_write) {
+    int lane_predicate = cute::elect_one_sync();
+
+    // Issue the epilogue waits
+    if (lane_predicate) {
+      /* This helps avoid early exit of blocks in Cluster
+       * Waits for all stages to either be released (all 
+       * Consumer UNLOCKs), or if the stage was never used
+       * then would just be acquired since the phase was 
+       * still inverted from make_producer_start_state
+       */
+      pipeline.producer_tail(smem_pipe_write);
+    }
+  }
+
+
+  template <
+    class TensorA,
+    class KTileIterator, class BlockCoord
+  >
+  CUTLASS_DEVICE void
+  prefetch_MK(
+      Params const& mainloop_params,
+      PrefetcherPipeline prefetcher_pipeline,
+      PipelineState smem_pipe_write,
+      TensorA const& gA_mkl,
+      BlockCoord const& blk_coord,
+      KTileIterator k_tile_iter, int k_tile_count,
+      int thread_idx,
+      uint32_t block_rank_in_cluster,
+      TensorStorage& shared_tensors) {
+    int lane_predicate = cute::elect_one_sync();
+
+    if (lane_predicate) {
+      bool do_best_effort_prefetch = mainloop_params.prefetch_ratio < 0;
+      float prefetch_ratio = do_best_effort_prefetch ? 1.0 : mainloop_params.prefetch_ratio;
+      int prefetch_iters = static_cast<int>(static_cast<float>(k_tile_count) * 0.5 * prefetch_ratio);
+      prefetch_iters = min(k_tile_count, ((prefetch_iters + PrefetchStages - 1) / PrefetchStages) * PrefetchStages);
+
+      Tensor sA = make_tensor(
+          make_smem_ptr(shared_tensors.smem_prefetch.data()), PrefetchSmemLayoutA{});             // (BLK_M,BLK_K,PIPE)
+
+      //
+      // Prepare the TMA loads for A
+      //
+
+      constexpr uint32_t cluster_shape_x = get<0>(typename DispatchPolicy::ClusterShape());
+      uint2 cluster_local_block_id = {block_rank_in_cluster % cluster_shape_x, block_rank_in_cluster / cluster_shape_x};
+
+      auto cta_tma_a = mainloop_params.tma_load_a.get_slice(cluster_local_block_id.y);
+
+      // Partition the inputs based on the current block coordinates.
+      auto [m_coord, n_coord, k_coord, l_coord] = blk_coord;
+      Tensor gA = gA_mkl(_,_,m_coord,_,l_coord);                                                     // (BLK_M,BLK_K,k)
+
+      // Applies the mapping from cta_tma_a
+      Tensor tAgA = cta_tma_a.partition_S(gA);                                                   // (TMA,TMA_M,TMA_K,k)
+      Tensor tAsA = cta_tma_a.partition_D(sA);                                                // (TMA,TMA_M,TMA_K,PIPE)
+
+      uint16_t mcast_mask_a = 0;
+
+      // Issue TmaLoads
+      // Maps the tile -> block, value
+      if constexpr (cute::is_same_v<GmemTiledCopyA, SM90_TMA_LOAD_MULTICAST>) {
+        auto block_layout = Layout<typename DispatchPolicy::ClusterShape>{};                       // (m,n) -> block_id
+        for (int n = 0; n < size<1>(block_layout); ++n) {
+          mcast_mask_a |= (uint16_t(1) << block_layout(cluster_local_block_id.x,n,Int<0>{}));
+        }
+      }
+
+      uint32_t prefetcher_stage = 0;
+      uint32_t prefetcher_phase = 0;
+      CUTLASS_PRAGMA_NO_UNROLL
+      for (int cnt = 0 ; cnt < prefetch_iters; ++cnt) {
+
+        if (do_best_effort_prefetch && prefetcher_pipeline.have_producers_arrived()) {
+          break;
+        }
+
+        prefetcher_pipeline.prefetcher_acquire(prefetcher_stage, prefetcher_phase, cnt >= PrefetchStages);
+        using BarrierType = typename PrefetcherPipeline::PrefetcherBarrierType;
+        BarrierType* tma_barrier = prefetcher_pipeline.prefetcher_get_barrier(prefetcher_stage);
+
+        int write_stage = 0;
+        copy(mainloop_params.tma_load_a.with(*tma_barrier, mcast_mask_a, cute::TMA::CacheHintSm90::EVICT_FIRST), tAgA(_,_,_,*k_tile_iter), tAsA(_,_,_,write_stage));
+        ++k_tile_iter;
+        ++k_tile_iter;
+
+        prefetcher_pipeline.advance_prefetcher_state(prefetcher_stage, prefetcher_phase);
+      }
+      prefetcher_pipeline.prefetcher_tail(prefetcher_stage, prefetcher_phase);
+    }
+  }
+
+  /// Perform a collective-scoped matrix multiply-accumulate
+  /// Consumer Perspective
+  template <
+    class FrgTensorC
+  >
+  CUTLASS_DEVICE void
+  mma(MainloopPipeline pipeline,
+      PipelineState smem_pipe_read,
+      FrgTensorC& accum,
+      int k_tile_count,
+      int thread_idx,
+      TensorStorage& shared_tensors,
+      Params const& mainloop_params) {
+    static_assert(is_rmem<FrgTensorC>::value, "C tensor must be rmem resident.");
+    static_assert(cute::rank(SmemLayoutA{}) == 3, "Smem layout must be rank 3.");
+    static_assert(cute::rank(SmemLayoutB{}) == 3, "Smem layout must be rank 3.");
+    static_assert(cute::is_void_v<SmemCopyAtomA>,
+      "SM90 GMMA mainloops cannot have a non-void copy atom for smem sourced instructions.");
+    static_assert(cute::is_void_v<SmemCopyAtomB>,
+      "SM90 GMMA mainloops cannot have a non-void copy atom for smem sourced instructions.");
+
+    Tensor sA = make_tensor(make_smem_ptr(shared_tensors.smem_A.data()), SmemLayoutA{});          // (BLK_M,BLK_K,PIPE)
+    Tensor sB = make_tensor(make_smem_ptr(shared_tensors.smem_B.data()), SmemLayoutB{});          // (BLK_N,BLK_K,PIPE)
+
+    //
+    // Define C accumulators and A/B partitioning
+    //
+
+    TiledMma tiled_mma;
+    auto thread_mma = tiled_mma.get_thread_slice(thread_idx);
+
+    Tensor tCsA = thread_mma.partition_A(sA);                                                 // (MMA,MMA_M,MMA_K,PIPE)
+    Tensor tCsB = thread_mma.partition_B(sB);                                                 // (MMA,MMA_N,MMA_K,PIPE)
+
+    // Allocate "fragments/descriptors"
+    Tensor tCrA = thread_mma.make_fragment_A(tCsA);                                           // (MMA,MMA_M,MMA_K,PIPE)
+    Tensor tCrB = thread_mma.make_fragment_B(tCsB);                                           // (MMA,MMA_N,MMA_K,PIPE)
+
+    CUTE_STATIC_ASSERT_V(size<1>(tCsA) == size<1>(accum));                                                         // M
+    CUTE_STATIC_ASSERT_V(size<1>(tCsB) == size<2>(accum));                                                         // N
+    CUTE_STATIC_ASSERT_V(size<2>(tCsA) == size<2>(tCsB));                                                          // K
+    CUTE_STATIC_ASSERT_V(size<3>(tCsA) == size<3>(tCsB));                                                       // PIPE
+    CUTE_STATIC_ASSERT_V(Int<DispatchPolicy::Stages>{} == size<2>(sA));                                         // PIPE
+    CUTE_STATIC_ASSERT_V(Int<DispatchPolicy::Stages>{} == size<2>(sB));                                         // PIPE
+
+    //
+    // PIPELINED MAIN LOOP
+    //
+    static_assert((0 <= K_PIPE_MMAS) && (K_PIPE_MMAS <  K_PIPE_MAX),
+        "ERROR : Incorrect number of MMAs in flight");
+
+    // We release buffers to producer warps(dma load) with some mmas in flight
+    PipelineState smem_pipe_release = smem_pipe_read;
+
+    // Prologue GMMAs
+    int prologue_mma_count = min(K_PIPE_MMAS, k_tile_count);
+
+    tiled_mma.accumulate_ = GMMA::ScaleOut::Zero;
+
+    warpgroup_fence_operand(accum);
+    CUTLASS_PRAGMA_UNROLL
+    for (int k_tile_prologue = prologue_mma_count; k_tile_prologue > 0; --k_tile_prologue)
+    {
+      // WAIT on smem_pipe_read until its data are available (phase bit flips from rdPhaseBit value)
+      auto barrier_token = pipeline.consumer_try_wait(smem_pipe_read);
+      pipeline.consumer_wait(smem_pipe_read, barrier_token);
+
+      int read_stage = smem_pipe_read.index();
+      warpgroup_arrive();
+      // Unroll the K mode manually to set scale D to 1
+      CUTLASS_PRAGMA_UNROLL
+      for (int k_block = 0; k_block < size<2>(tCrA); ++k_block) {
+        // (V,M,K) x (V,N,K) => (V,M,N)
+        cute::gemm(tiled_mma, tCrA(_,_,k_block,read_stage), tCrB(_,_,k_block,read_stage), accum);
+        tiled_mma.accumulate_ = GMMA::ScaleOut::One;
+      }
+
+      warpgroup_commit_batch();
+
+      ++smem_pipe_read;
+    }
+
+    warpgroup_fence_operand(accum);
+    // Mainloop GMMAs
+    k_tile_count -= prologue_mma_count;
+
+    CUTLASS_PRAGMA_NO_UNROLL
+    for ( ; k_tile_count > 0; --k_tile_count)
+    {
+      // WAIT on smem_pipe_read until its data are available (phase bit flips from rdPhaseBit value)
+      auto barrier_token = pipeline.consumer_try_wait(smem_pipe_read);
+      pipeline.consumer_wait(smem_pipe_read, barrier_token);
+
+      //
+      // Compute on k_tile
+      //
+
+      int read_stage = smem_pipe_read.index();
+      warpgroup_fence_operand(accum);
+      warpgroup_arrive();
+      // Unroll the K mode manually to set scale D to 1
+      CUTLASS_PRAGMA_UNROLL
+      for (int k_block = 0; k_block < size<2>(tCrA); ++k_block) {
+        // (V,M,K) x (V,N,K) => (V,M,N)
+        cute::gemm(tiled_mma, tCrA(_,_,k_block,read_stage), tCrB(_,_,k_block,read_stage), accum);
+        tiled_mma.accumulate_ = GMMA::ScaleOut::One;
+      }
+      warpgroup_commit_batch();
+
+      /// Wait on the GMMA barrier for K_PIPE_MMAS (or fewer) outstanding to ensure smem_pipe_write is consumed
+      warpgroup_wait<K_PIPE_MMAS>();
+      warpgroup_fence_operand(accum);
+
+      // UNLOCK smem_pipe_release, done _computing_ on it
+      pipeline.consumer_release(smem_pipe_release);
+
+      // Advance smem_pipe_read and smem_pipe_release
+      ++smem_pipe_read;
+      ++smem_pipe_release;
+    }
+
+    warpgroup_fence_operand(accum);
+  }
+
+  /// Perform a Consumer Epilogue to release all buffers
+  CUTLASS_DEVICE void
+  mma_tail(MainloopPipeline pipeline, PipelineState smem_pipe_release, int k_tile_count) {
+    // Prologue GMMAs
+    int prologue_mma_count = min(K_PIPE_MMAS, k_tile_count);
+    k_tile_count -= prologue_mma_count;
+
+    smem_pipe_release.advance(k_tile_count);
+    
+    // Wait on all GMMAs to complete
+    warpgroup_wait<0>();
+
+    for (int count = 0; count < prologue_mma_count; ++count) {
+      pipeline.consumer_release(smem_pipe_release);                 // UNLOCK smem_pipe_release, done _computing_ on it
+      ++smem_pipe_release;
+    }
+  }
+};
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+} // namespace cutlass::gemm::collective
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/examples/63_hopper_gemm_with_weight_prefetch/gemm_with_weight_prefetch_commandline.hpp b/examples/63_hopper_gemm_with_weight_prefetch/gemm_with_weight_prefetch_commandline.hpp
new file mode 100644
index 0000000000..6be87768ee
--- /dev/null
+++ b/examples/63_hopper_gemm_with_weight_prefetch/gemm_with_weight_prefetch_commandline.hpp
@@ -0,0 +1,117 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+// Command line options parsing
+struct Options {
+
+  bool help = false;
+
+  float alpha = 1.f, beta = 0.f;
+  float overlap_ratio = 0.5f, prefetch_ratio = 0.5f;
+  int iterations = 1000;
+  int n = 64, m = 1280, k = 8192, l = 1;
+
+  // Parses the command line
+  void parse(int argc, char const **args) {
+    cutlass::CommandLine cmd(argc, args);
+
+    if (cmd.check_cmd_line_flag("help")) {
+      help = true;
+      return;
+    }
+
+    cmd.get_cmd_line_argument("m", m);
+    cmd.get_cmd_line_argument("n", n);
+    cmd.get_cmd_line_argument("k", k);
+    cmd.get_cmd_line_argument("l", l);
+    cmd.get_cmd_line_argument("alpha", alpha, 1.f);
+    cmd.get_cmd_line_argument("beta", beta, 0.f);
+    cmd.get_cmd_line_argument("p", prefetch_ratio, 0.5f);
+    cmd.get_cmd_line_argument("o", overlap_ratio, 0.5f);
+    cmd.get_cmd_line_argument("iterations", iterations);
+  }
+
+  /// Prints the usage statement.
+  std::ostream & print_usage(std::ostream &out) const {
+
+    out << "63_hopper_gemm_with_weight_prefetch\n\n"
+      << "  Hopper FP8 GEMM using a non-persistent kernel with L2 weight prefetch. \n"
+      << "  For more details please refer to the source file.\n\n"
+      << "Options:\n\n"
+      << "  --help                      If specified, displays this usage statement\n\n"
+      << "  --m=<int>                   Sets the M extent of the GEMM\n"
+      << "  --n=<int>                   Sets the N extent of the GEMM\n"
+      << "  --k=<int>                   Sets the K extent of the GEMM\n"
+      << "  --l=<int>                   Sets the l extent (batch) of the GEMM\n"
+      << "  --alpha=<f32>               Epilogue scalar alpha\n"
+      << "  --beta=<f32>                Epilogue scalar beta\n"
+      << "  --p=<f32>                   Prefetch ratio\n"
+      << "  --o=<f32>                   Overlap ratio\n"
+      << "  --iterations=<int>          Number of profiling iterations to perform.\n\n";
+
+    out
+      << "\n\nExamples:\n\n"
+      << "$ " << "63_hopper_gemm_with_weight_prefetch" << 
+      " --m=1024 --n=512 --k=1024 --o=0.5 --p=0.5 \n\n";
+
+    return out;
+  }
+
+  /// Compute performance in GFLOP/s
+  double gflops(double runtime_s) const
+  {
+    // Two flops per multiply-add
+    uint64_t flop = uint64_t(2) * m * n * k * l;
+    double gflop = double(flop) / double(1.0e9);
+    return gflop / runtime_s;
+  }
+
+  /// Compute effective bandwidth in GB/sec
+  double effective_bandwidth(
+    double runtime_s,
+    size_t bytes_a,
+    size_t bytes_b,
+    size_t bytes_c,
+    size_t bytes_d
+  ) const
+  {
+    static double const kBytesPerGiB = double(1ull << 30);
+
+    double bytes_in = 
+      (double)(l) * (double)(m) * (double)(k) * (double)(bytes_a) +                        // A
+      (double)(l) * (double)(n) * (double)(k) * (double)(bytes_b) +                        // B
+      (beta != 0.f ? (double)(l) * (double)(m) * (double)(n) * (double)(bytes_c) : 0.f);   // C
+    double bytes_out = (double)(l) * (double)(m) * (double)(n) * (double)(bytes_d);        // D
+
+    double gb_total = (bytes_in + bytes_out) / kBytesPerGiB;
+    return gb_total / runtime_s;
+  }
+};
diff --git a/examples/63_hopper_gemm_with_weight_prefetch/kernel/sm90_gemm_tma_warpspecialized_with_prefetch.hpp b/examples/63_hopper_gemm_with_weight_prefetch/kernel/sm90_gemm_tma_warpspecialized_with_prefetch.hpp
new file mode 100644
index 0000000000..6e33d8fc62
--- /dev/null
+++ b/examples/63_hopper_gemm_with_weight_prefetch/kernel/sm90_gemm_tma_warpspecialized_with_prefetch.hpp
@@ -0,0 +1,561 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+#pragma once
+
+#include "cutlass/cutlass.h"
+#include "cutlass/fast_math.h"
+#include "cutlass/kernel_hardware_info.hpp"
+#include "cute/arch/cluster_sm90.hpp"
+#include "cutlass/arch/reg_reconfig.h"
+#include "cutlass/arch/mma_sm90.h"
+#include "cutlass/epilogue/collective/detail.hpp"
+#include "cutlass/gemm/gemm.h"
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/gemm/kernel/sm90_tile_scheduler.hpp"
+#include "cutlass/pipeline/pipeline.hpp"
+#include "cutlass/trace.h"
+
+#include "cute/tensor.hpp"
+
+#include "../collective/dispatch_policy_extra.hpp"
+
+///////////////////////////////////////////////////////////////////////////////
+
+namespace cutlass::gemm::kernel {
+
+///////////////////////////////////////////////////////////////////////////////
+
+// GEMM + Prefetch for the A tensor + (optional) split DMA warps
+template <
+  class ProblemShape_,
+  class CollectiveMainloop_,
+  class CollectiveEpilogue_,
+  class TileScheduler_
+>
+class GemmUniversal<
+  ProblemShape_,
+  CollectiveMainloop_,
+  CollectiveEpilogue_,
+  TileScheduler_,
+  cute::enable_if_t<
+    cute::is_same_v<typename CollectiveMainloop_::DispatchPolicy::Schedule, KernelTmaWarpSpecializedFP8FastAccumWithPrefetchAndSplitDMA> || 
+    cute::is_same_v<typename CollectiveMainloop_::DispatchPolicy::Schedule, KernelTmaWarpSpecializedFP8FastAccumWithPrefetch>
+    >
+>
+{
+public:
+  //
+  // Type Aliases
+  //
+  using ProblemShape = ProblemShape_;
+  static_assert(cute::rank(ProblemShape{}) == 3 or cute::rank(ProblemShape{}) == 4,
+    "ProblemShape{} should be <M,N,K> or <M,N,K,L>");
+  static constexpr bool IsGdcEnabled = cutlass::arch::IsGdcGloballyEnabled;
+
+  static constexpr bool SplitWarps = cute::is_same_v<typename CollectiveMainloop_::DispatchPolicy::Schedule, KernelTmaWarpSpecializedFP8FastAccumWithPrefetchAndSplitDMA>;
+
+  // Mainloop derived types
+  using CollectiveMainloop = CollectiveMainloop_;
+  using TileShape = typename CollectiveMainloop::TileShape;
+  using TiledMma  = typename CollectiveMainloop::TiledMma;
+  using ArchTag   = typename CollectiveMainloop::ArchTag;
+  using ElementA  = typename CollectiveMainloop::ElementA;
+  using StrideA   = typename CollectiveMainloop::StrideA;
+  using ElementB  = typename CollectiveMainloop::ElementB;
+  using StrideB   = typename CollectiveMainloop::StrideB;
+  using DispatchPolicy = typename CollectiveMainloop::DispatchPolicy;
+  using ElementAccumulator = typename CollectiveMainloop::ElementAccumulator;
+  using ClusterShape = typename DispatchPolicy::ClusterShape;
+  using MainloopArguments = typename CollectiveMainloop::Arguments;
+  using MainloopParams = typename CollectiveMainloop::Params;
+  static_assert(ArchTag::kMinComputeCapability >= 90);
+
+  // Epilogue derived types
+  using CollectiveEpilogue = CollectiveEpilogue_;
+  using ElementC = typename CollectiveEpilogue::ElementC;
+  using StrideC  = typename CollectiveEpilogue::StrideC;
+  using ElementD = typename CollectiveEpilogue::ElementD;
+  using StrideD  = typename CollectiveEpilogue::StrideD;
+  using EpilogueArguments = typename CollectiveEpilogue::Arguments;
+  using EpilogueParams = typename CollectiveEpilogue::Params;
+
+  static_assert(cute::is_void_v<TileScheduler_> or cute::is_same_v<TileScheduler_, PersistentScheduler>,
+    "TMA warp-specialized kernel does not support specializing the tile scheduler.");
+  using TileSchedulerTag = TileScheduler_;
+  using TileScheduler = typename detail::TileSchedulerSelector<
+    TileScheduler_, ArchTag, TileShape, ClusterShape>::Scheduler;
+  using TileSchedulerArguments = typename TileScheduler::Arguments;
+
+  // Kernel level shared memory storage
+  struct SharedStorage {
+    // Mainloop and epilogue don't use smem concurrently since kernel is non-persistent, so we can use a union
+    union TensorStorage {
+      using MainloopTensorStorage = typename CollectiveMainloop::TensorStorage;
+      using EpilogueTensorStorage = typename CollectiveEpilogue::TensorStorage;
+
+      MainloopTensorStorage mainloop;
+      EpilogueTensorStorage epilogue;
+    } tensors;
+
+    struct PipelineStorage : cute::aligned_struct<16, _1> {
+      using MainloopPipelineStorage = typename CollectiveMainloop::PipelineStorage;
+      using PrefetcherPipelineStorage = typename CollectiveMainloop::PrefetcherPipelineStorage;
+      using EpiLoadPipelineStorage = typename CollectiveEpilogue::PipelineStorage;
+
+      alignas(16) MainloopPipelineStorage mainloop;
+      alignas(16) EpiLoadPipelineStorage epi_load;
+      alignas(16) PrefetcherPipelineStorage prefetcher;
+    } pipelines;
+  };
+
+  static constexpr int SharedStorageSize = sizeof(SharedStorage);
+
+  static constexpr uint32_t NumLoadWarpGroups = 1;
+  static constexpr uint32_t NumMmaWarpGroups = 1;
+  static constexpr uint32_t MaxThreadsPerBlock = CUTE_STATIC_V(size(TiledMma{})) + (NumLoadWarpGroups * NumThreadsPerWarpGroup);
+  static constexpr uint32_t MinBlocksPerMultiprocessor = 1;
+
+  // Device side arguments
+  struct Arguments {
+    GemmUniversalMode mode{};
+    ProblemShape problem_shape{};
+    MainloopArguments mainloop{};
+    EpilogueArguments epilogue{};
+    KernelHardwareInfo hw_info{};
+    TileSchedulerArguments scheduler{};
+  };
+
+  // Kernel entry point API
+  struct Params {
+    GemmUniversalMode mode{};
+    ProblemShape problem_shape{};
+    MainloopParams mainloop{};
+    EpilogueParams epilogue{};
+  };
+
+  //
+  // Methods
+  //
+
+  // Convert to underlying arguments. In this case, a simple copy for the aliased type.
+  static
+  Params
+  to_underlying_arguments(Arguments const& args, void* workspace) {
+    (void) workspace;
+    auto problem_shape = args.problem_shape;
+    if constexpr (detail::Has_SwapAB_v<CollectiveMainloop>) {
+      // swap M/N
+      get<0>(problem_shape) = get<1>(args.problem_shape);
+      get<1>(problem_shape) = get<0>(args.problem_shape);
+    }
+    return {
+      args.mode,
+      problem_shape,
+      CollectiveMainloop::to_underlying_arguments(args.problem_shape, args.mainloop, workspace),
+      CollectiveEpilogue::to_underlying_arguments(args.problem_shape, args.epilogue, workspace)
+    };
+  }
+
+  static bool
+  can_implement(Arguments const& args) {
+    bool implementable = (args.mode == GemmUniversalMode::kGemm) or
+        (args.mode == GemmUniversalMode::kBatched && cute::rank(ProblemShape{}) == 4);
+    if (!implementable) {
+      CUTLASS_TRACE_HOST("  CAN IMPLEMENT: Arguments or Problem Shape don't meet the requirements.\n");
+      return implementable;
+    }
+    implementable &= CollectiveMainloop::can_implement(args.problem_shape, args.mainloop);
+    implementable &= CollectiveEpilogue::can_implement(args.problem_shape, args.epilogue);
+    implementable &= TileScheduler::can_implement(args.scheduler);
+
+    return implementable;
+  }
+
+  static
+  size_t
+  get_workspace_size(Arguments const& args) {
+    return 0;
+  }
+
+  static
+  cutlass::Status
+  initialize_workspace(Arguments const& args, void* workspace = nullptr, cudaStream_t stream = nullptr,
+    CudaHostAdapter* cuda_adapter = nullptr) {
+    return Status::kSuccess;
+  }
+
+  // Computes the kernel launch grid shape based on runtime parameters
+  static dim3
+  get_grid_shape(Params const& params) {
+    auto cluster_shape = ClusterShape{};
+    auto tile_shape = TileShape{};
+    auto problem_shape_MNKL = append<4>(params.problem_shape, Int<1>{});
+    return TileScheduler::get_tiled_cta_shape_mnl(
+        problem_shape_MNKL, tile_shape, cluster_shape);
+  }
+
+  static dim3
+  get_block_shape() {
+    return dim3(MaxThreadsPerBlock, 1, 1);
+  }
+
+  CUTLASS_DEVICE
+  void
+  operator()(Params const& params, char* smem_buf) {
+    using namespace cute;
+    using X = Underscore;
+
+#if defined(__CUDA_ARCH_FEAT_SM90_ALL)
+#  define ENABLE_SM90_KERNEL_LEVEL 1
+#endif
+
+// Any Tensor Op MMA Atom in the WGMMA ISA is arch conditional to sm90a.
+#if ! defined(ENABLE_SM90_KERNEL_LEVEL)
+    printf("ERROR : Arch conditional MMA instruction used without targeting sm90a compute capability. Aborting.\n");
+#else
+
+    enum class WarpGroupRole {
+      Producer = 0,
+      Consumer = 1,
+    };
+    // Split mode: use Warp0 to load NK and epilogue, Warp2 to load MK.
+    // Non-split mode: use Warp0 to load MK, NK and epilogue, Warp2 is unused.
+    // Both modes use Warp1 to prefetch.
+    enum class ProducerWarpRole {
+      Warp0 = 0,
+      PrefetchMK = 1,
+      Warp2 = 2,
+      UnusedWarp = 3
+    };
+
+    // Kernel level shared memory storage
+    SharedStorage& shared_storage = *reinterpret_cast<SharedStorage*>(smem_buf);
+
+    int thread_idx = int(threadIdx.x);
+    int lane_idx = canonical_lane_idx();
+    int warp_idx = canonical_warp_idx_sync();
+    int warp_idx_in_warp_group = warp_idx % NumWarpsPerWarpGroup;
+    int warp_group_thread_idx = thread_idx % NumThreadsPerWarpGroup;
+    auto warp_group_role = WarpGroupRole(canonical_warp_group_idx());
+    auto producer_warp_role = ProducerWarpRole(warp_idx_in_warp_group);
+    int lane_predicate = cute::elect_one_sync();
+    uint32_t block_rank_in_cluster = cute::block_rank_in_cluster();
+
+
+    // Issue Tma Descriptor Prefetch from a single thread
+    if ((warp_idx == 0) && lane_predicate) {
+      CollectiveMainloop::prefetch_tma_descriptors(params.mainloop);
+      CollectiveEpilogue::prefetch_tma_descriptors(params.epilogue);
+    }
+
+    // Mainloop Load pipeline
+    using MainloopPipeline = typename CollectiveMainloop::MainloopPipeline;
+    typename MainloopPipeline::Params mainloop_pipeline_params;
+    mainloop_pipeline_params.is_leader = warp_group_thread_idx == 0;
+    if (warp_group_role == WarpGroupRole::Producer && (
+          producer_warp_role == ProducerWarpRole::Warp0 ||
+          producer_warp_role == ProducerWarpRole::Warp2)) {
+      mainloop_pipeline_params.role = MainloopPipeline::ThreadCategory::Producer;
+      mainloop_pipeline_params.transaction_bytes = params.mainloop.tma_transaction_bytes;
+    }
+    if (warp_group_role == WarpGroupRole::Consumer) {
+      mainloop_pipeline_params.role = MainloopPipeline::ThreadCategory::Consumer;
+    }
+    mainloop_pipeline_params.num_consumers = NumThreadsPerWarpGroup;
+    MainloopPipeline mainloop_pipeline(shared_storage.pipelines.mainloop, mainloop_pipeline_params, ClusterShape{});
+    bool should_prefetch = params.mainloop.prefetch_ratio > 0;
+    using PrefetcherPipeline = typename CollectiveMainloop::PrefetcherPipeline;
+    typename PrefetcherPipeline::Params prefetcher_pipeline_params;
+    prefetcher_pipeline_params.num_prefetchers = 1;
+    if (warp_group_role == WarpGroupRole::Producer && producer_warp_role == ProducerWarpRole::PrefetchMK) {
+      prefetcher_pipeline_params.should_prefetch = should_prefetch;
+      prefetcher_pipeline_params.transaction_bytes = params.mainloop.tma_transaction_bytes_mk;
+    }
+    PrefetcherPipeline prefetcher_pipeline(shared_storage.pipelines.prefetcher, prefetcher_pipeline_params);
+
+    // Epilogue Load pipeline
+    using EpiLoadPipeline = typename CollectiveEpilogue::LoadPipeline;
+    typename EpiLoadPipeline::Params epi_load_pipeline_params;
+    if (warp_group_role == WarpGroupRole::Producer && producer_warp_role == ProducerWarpRole::Warp0) {
+      epi_load_pipeline_params.role = EpiLoadPipeline::ThreadCategory::Producer;
+    }
+    if (warp_group_role == WarpGroupRole::Consumer) {
+      epi_load_pipeline_params.role = EpiLoadPipeline::ThreadCategory::Consumer;
+    }
+    epi_load_pipeline_params.dst_blockid = cute::block_rank_in_cluster();
+    epi_load_pipeline_params.producer_arv_count = NumThreadsPerWarp;
+    epi_load_pipeline_params.consumer_arv_count = NumThreadsPerWarpGroup;
+    if constexpr (CollectiveEpilogue::RequiresTransactionBytes) {
+      epi_load_pipeline_params.transaction_bytes = params.epilogue.tma_transaction_bytes;
+    }
+    EpiLoadPipeline epi_load_pipeline(shared_storage.pipelines.epi_load, epi_load_pipeline_params);
+
+    // Epilogue Store pipeline
+    using EpiStorePipeline = typename CollectiveEpilogue::StorePipeline;
+    typename EpiStorePipeline::Params epi_store_pipeline_params;
+    epi_store_pipeline_params.always_wait = true;
+    EpiStorePipeline epi_store_pipeline(epi_store_pipeline_params);
+
+    // Initialize starting pipeline states for the collectives
+    // Epilogue store pipe is producer-only (consumer is TMA unit, waits via scoreboarding)
+    typename CollectiveMainloop::PipelineState mainloop_pipe_consumer_state;
+    typename CollectiveEpilogue::LoadPipelineState epi_load_pipe_consumer_state;
+
+    // For the DMA Load (producer) we start with an opposite phase
+    // i.e., we skip all waits since we know that the buffer is indeed empty
+    PipelineState mainloop_pipe_producer_state = cutlass::make_producer_start_state<MainloopPipeline>();
+    PipelineState epi_load_pipe_producer_state = cutlass::make_producer_start_state<EpiLoadPipeline>();
+    PipelineState epi_store_pipe_producer_state = cutlass::make_producer_start_state<EpiStorePipeline>();
+
+    auto cluster_wait_fn = [&] () {
+      // We need this to guarantee that the Pipeline init is visible
+      // To all producers and consumer thread blocks in the Cluster
+      if constexpr (size(ClusterShape{}) > 1) {
+        // Non-prefetcher warps arrive and wait,
+        // Prefetcher warp can go ahead without waiting.
+        cute::cluster_arrive_relaxed();
+        if (warp_group_role != WarpGroupRole::Producer ||
+            producer_warp_role != ProducerWarpRole::PrefetchMK) {
+          cute::cluster_wait();
+        }
+        return [] () {};
+      }
+      else {
+        // __syncthreads() but only for non prefetcher warps
+        if (should_prefetch) {
+
+          // Use a named barrier to let the prefetcher warp start loading into the L2
+          // without waiting to sync with all other warps.
+          // All other warps need to sync because the mainloop pipeline init
+          // should be visible to all of them.
+          // Prefetcher has its own barriers, and the only warps it would need to sync
+          // with would be the DMA warps.
+          using ClusterSyncWithPrefetchBarrier = typename cutlass::arch::NamedBarrier;
+          auto prefetcher_arrive_barrier = ClusterSyncWithPrefetchBarrier(
+              blockDim.x * blockDim.y * blockDim.z,
+              /*reserved_named_barriers_*/ 14);
+          // Prefetcher warp doesn't arrive on this barrier.
+          auto cluster_arrive_barrier = ClusterSyncWithPrefetchBarrier(
+              blockDim.x * blockDim.y * blockDim.z - NumThreadsPerWarp,
+              /*reserved_named_barriers_*/ 15);
+
+          if (warp_group_role == WarpGroupRole::Producer && producer_warp_role == ProducerWarpRole::PrefetchMK) {
+            __syncwarp();
+            prefetcher_arrive_barrier.arrive();
+          }
+          else if (warp_group_role == WarpGroupRole::Producer) {
+            prefetcher_arrive_barrier.arrive_and_wait();
+            cluster_arrive_barrier.arrive_and_wait();
+          }
+          else {
+            prefetcher_arrive_barrier.arrive();
+            cluster_arrive_barrier.arrive_and_wait();
+          }
+        } else {
+        __syncthreads();
+        }
+        return [] () {};
+      }
+    } ();
+
+    // Preconditions
+    static_assert(cute::rank(StrideA{}) == 3, "StrideA must be rank-3: [M, K, L]. If batch mode is not needed, set L stride to Int<0>.");
+    static_assert(cute::rank(StrideB{}) == 3, "StrideB must be rank-3: [N, K, L]. If batch mode is not needed, set L stride to Int<0>.");
+    static_assert(cute::rank(StrideC{}) == 3, "StrideC must be rank-3: [M, N, L]. If batch mode is not needed, set L stride to Int<0>.");
+    static_assert(cute::rank(StrideD{}) == 3, "StrideD must be rank-3: [M, N, L]. If batch mode is not needed, set L stride to Int<0>.");
+
+    // Optionally append 1s until problem shape is rank-4 in case it is only rank-3 (MNK)
+    auto problem_shape_MNKL = append<4>(params.problem_shape, Int<1>{});
+
+    // Get the appropriate blocks for this thread block -- potential for thread block locality
+    auto blk_shape = TileShape{};                                                                // (BLK_M,BLK_N,BLK_K)
+    TiledMma tiled_mma;
+
+    // In a warp specialized kernel, collectives expose data movement and compute operations separately
+    CollectiveMainloop collective_mainloop;
+    CollectiveEpilogue collective_epilogue(params.epilogue, shared_storage.tensors.epilogue);
+
+    // Prepare and partition the input tensors. Expects a tuple of tensors where:
+    // get<0>(load_inputs) is the tma tensor A after local tiling so that it has shape (BLK_M,BLK_K,m,k,l)
+    // get<1>(load_inputs) is the tma tensor B after local tiling so that it has shape (BLK_N,BLK_K,n,k,l)
+    auto load_inputs = collective_mainloop.load_init(problem_shape_MNKL, params.mainloop);
+    static_assert(cute::tuple_size_v<decltype(load_inputs)> >= 2, "Output of load_init must have at least two elements (A, B)");
+
+    // Extract out partitioned A and B.
+    Tensor gA_mkl = get<0>(load_inputs);
+    Tensor gB_nkl = get<1>(load_inputs);
+
+    // Compute m_coord, n_coord, and l_coord with their post-tiled shapes
+    auto m_coord = idx2crd(int(blockIdx.x), shape<2>(gA_mkl));
+    auto n_coord = idx2crd(int(blockIdx.y), shape<2>(gB_nkl));
+    auto l_coord = idx2crd(int(blockIdx.z), shape<4>(gB_nkl));
+    auto blk_coord = make_coord(m_coord, n_coord, _, l_coord);
+
+    // Get pipeline iterators and increments from tensor shapes
+    auto k_tile_iter  = cute::make_coord_iterator(shape<3>(gA_mkl));
+    auto k_tile_count = size<3>(gA_mkl);
+
+    // Wait for all thread blocks in the Cluster
+    cluster_wait_fn();
+
+    if (warp_group_role == WarpGroupRole::Producer) {
+      if (producer_warp_role == ProducerWarpRole::Warp0) {
+        if constexpr(SplitWarps) {
+          collective_mainloop.load_NK(
+            params.mainloop,
+            mainloop_pipeline,
+            prefetcher_pipeline,
+            mainloop_pipe_producer_state,
+            gB_nkl,
+            blk_coord,
+            k_tile_iter, k_tile_count,
+            lane_idx,
+            block_rank_in_cluster,
+            shared_storage.tensors.mainloop
+          );
+        }
+        else {
+          collective_mainloop.load(
+            params.mainloop,
+            mainloop_pipeline,
+            prefetcher_pipeline,
+            mainloop_pipe_producer_state,
+            gA_mkl, gB_nkl,
+            blk_coord,
+            k_tile_iter, k_tile_count,
+            lane_idx,
+            block_rank_in_cluster,
+            shared_storage.tensors.mainloop
+          );
+        }
+        // Update starting mainloop pipeline state for the pipeline drain
+        mainloop_pipe_producer_state.advance(k_tile_count);
+        // Make sure mainloop consumer has been waited upon before issuing epilogue load
+        collective_mainloop.load_tail(mainloop_pipeline, mainloop_pipe_producer_state);
+
+        if (collective_epilogue.is_producer_load_needed()) {
+          // Ensure warp is converged before issuing epilogue loads
+          __syncwarp();
+          epi_load_pipe_producer_state = collective_epilogue.load(
+            epi_load_pipeline,
+            epi_load_pipe_producer_state,
+            problem_shape_MNKL,
+            blk_shape,
+            blk_coord,
+            tiled_mma,
+            lane_idx,
+            shared_storage.tensors.epilogue
+          );
+          collective_epilogue.load_tail(epi_load_pipeline, epi_load_pipe_producer_state);
+        }
+      }
+      else if (SplitWarps && producer_warp_role == ProducerWarpRole::Warp2) {
+        collective_mainloop.load_MK(
+          params.mainloop,
+          mainloop_pipeline,
+          prefetcher_pipeline,
+          mainloop_pipe_producer_state,
+          gA_mkl,
+          blk_coord,
+          k_tile_iter, k_tile_count,
+          lane_idx,
+          block_rank_in_cluster,
+          shared_storage.tensors.mainloop
+        );
+        // Update starting mainloop pipeline state for the pipeline drain
+        mainloop_pipe_producer_state.advance(k_tile_count);
+        // Make sure mainloop consumer has been waited upon before issuing epilogue load
+        collective_mainloop.load_tail(mainloop_pipeline, mainloop_pipe_producer_state);
+      } else if (producer_warp_role == ProducerWarpRole::PrefetchMK && should_prefetch) {
+        collective_mainloop.prefetch_MK(
+          params.mainloop,
+          prefetcher_pipeline,
+          mainloop_pipe_producer_state,
+          gA_mkl,
+          blk_coord,
+          k_tile_iter, k_tile_count,
+          lane_idx,
+          block_rank_in_cluster,
+          shared_storage.tensors.mainloop
+        );
+      }
+    }
+    else if (warp_group_role == WarpGroupRole::Consumer) {
+      Tensor accumulators = partition_fragment_C(tiled_mma, take<0,2>(blk_shape));                 // (MMA,MMA_M,MMA_N)
+
+      collective_mainloop.mma(
+        mainloop_pipeline,
+        mainloop_pipe_consumer_state,
+        accumulators,
+        k_tile_count,
+        warp_group_thread_idx,
+        shared_storage.tensors.mainloop,
+        params.mainloop
+      );
+
+      // Make sure the math instructions are done and free buffers before entering the epilogue
+      collective_mainloop.mma_tail(
+        mainloop_pipeline,
+        mainloop_pipe_consumer_state,
+        k_tile_count
+      );
+
+      // Epilogue and write to gD
+      auto [epi_load_pipe_consumer_state_next, epi_store_pipe_producer_state_next] =
+      collective_epilogue.store(
+        epi_load_pipeline,
+        epi_load_pipe_consumer_state,
+        epi_store_pipeline,
+        epi_store_pipe_producer_state,
+        problem_shape_MNKL,
+        blk_shape,
+        blk_coord,
+        accumulators,
+        tiled_mma,
+        warp_group_thread_idx,
+        shared_storage.tensors.epilogue
+      );
+
+      collective_epilogue.store_tail(
+        epi_load_pipeline,
+        epi_load_pipe_consumer_state_next,
+        epi_store_pipeline,
+        epi_store_pipe_producer_state_next
+      );
+    }
+#endif
+  }
+};
+
+///////////////////////////////////////////////////////////////////////////////
+
+} // namespace cutlass::gemm::kernel
diff --git a/examples/63_hopper_gemm_with_weight_prefetch/pipeline/prefetch_pipeline_sm90.hpp b/examples/63_hopper_gemm_with_weight_prefetch/pipeline/prefetch_pipeline_sm90.hpp
new file mode 100644
index 0000000000..7abd39ccfc
--- /dev/null
+++ b/examples/63_hopper_gemm_with_weight_prefetch/pipeline/prefetch_pipeline_sm90.hpp
@@ -0,0 +1,161 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+#pragma once
+
+#include "cutlass/cutlass.h"
+#include "cute/arch/cluster_sm90.hpp"
+#include "cutlass/arch/barrier.h"
+#include "cute/container/array.hpp"
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+namespace cutlass {
+
+namespace detail {
+
+// MSVC work-around
+template <int Stages>
+struct PrefetcherPipelineSharedStorage {
+  using TransactionBarrier = cutlass::arch::ClusterTransactionBarrier;
+  using Barrier = cutlass::arch::ClusterBarrier;
+
+  TransactionBarrier tma_barrier[Stages];
+  Barrier producer_ready_barrier;
+};
+
+} // end namespace detail
+
+using namespace cute;
+
+// Prefetcher pipeline is modeled after PipelineTmaAsync, with a cluster transaction
+// barrier providing control over the number of concurrent outstanding TMA loads.
+// There is also an additional cluster barrier which is only used when `prefetch_ratio` is unset.
+// `prefetch_ratio` determines how many K tiles get loaded, and when unset, the prefetcher checks
+// whether DMA warps are done waiting on griddepcontrol, and if so, stops issuing more TMA loads.
+template <int Stages_>
+class PrefetchPipeline {
+public :
+  static constexpr uint32_t Stages = Stages_;
+  using SharedStorage = detail::PrefetcherPipelineSharedStorage<Stages>;
+
+  using TransactionBarrier = typename SharedStorage::TransactionBarrier;
+  using Barrier = typename SharedStorage::Barrier;
+  using PrefetcherBarrierType = typename TransactionBarrier::ValueType;
+
+  struct Params {
+    uint32_t transaction_bytes = 0;
+    uint32_t num_prefetchers = 1;
+    bool should_prefetch = false;
+  };
+
+  // Constructor
+  CUTLASS_DEVICE
+  PrefetchPipeline(SharedStorage& storage, Params params)
+      : params_(params)
+      , tma_barrier_ptr_(&storage.tma_barrier[0])
+      , producer_ready_barrier_ptr_(&storage.producer_ready_barrier) {
+
+    int lane_predicate = cute::elect_one_sync();
+    if (params.should_prefetch && lane_predicate) {
+      CUTLASS_PRAGMA_UNROLL
+      for (int i = 0; i < Stages; ++i) {
+        tma_barrier_ptr_[i].init(params.num_prefetchers);
+      }
+      producer_ready_barrier_ptr_[0].init(1);
+    }
+  }
+
+  CUTLASS_DEVICE
+  void producer_arrive() {
+    if (params_.should_prefetch) {
+      producer_ready_barrier_ptr_[0].arrive();
+    }
+  }
+
+  CUTLASS_DEVICE
+  bool have_producers_arrived() {
+    if (params_.should_prefetch) {
+      uint32_t barrier_status_ = producer_ready_barrier_ptr_[0].try_wait(0);
+      auto barrier_status = static_cast<BarrierStatus>(barrier_status_);
+      if (barrier_status == BarrierStatus::WaitDone) {
+        return true; // exit prefetcher loop
+      }
+      return false;
+    }
+    return true;
+  }
+
+  CUTLASS_DEVICE
+  void prefetcher_acquire(uint32_t stage, uint32_t phase, bool should_wait) {
+    if (params_.should_prefetch) {
+      if (should_wait) {
+        tma_barrier_ptr_[stage].wait(phase ^ 1);
+      }
+      tma_barrier_ptr_[stage].arrive_and_expect_tx(params_.transaction_bytes);
+    }
+  }
+
+  CUTLASS_DEVICE
+  void advance_prefetcher_state(uint32_t& stage, uint32_t& phase) {
+    if (params_.should_prefetch) {
+      stage++;
+      if (stage == Stages) {
+        stage = 0;
+        phase ^= 1;
+      }
+    }
+  }
+
+  CUTLASS_DEVICE
+  void prefetcher_tail(uint32_t stage, uint32_t phase) {
+    if (params_.should_prefetch) {
+      // Wait on any already-issued loads
+      CUTLASS_PRAGMA_UNROLL
+      for (int i = 0; i < stage; ++i) {
+        tma_barrier_ptr_[i].wait(phase);
+      }
+    }
+  }
+
+  CUTLASS_DEVICE
+  PrefetcherBarrierType* prefetcher_get_barrier(uint32_t stage) {
+    return reinterpret_cast<PrefetcherBarrierType*>(&tma_barrier_ptr_[stage]);
+  }
+
+private :
+  TransactionBarrier* tma_barrier_ptr_ = nullptr;
+  Barrier* producer_ready_barrier_ptr_ = nullptr;
+  Params params_;
+
+};
+
+}  // end namespace cutlass
diff --git a/examples/CMakeLists.txt b/examples/CMakeLists.txt
index e9a7a02c9b..b40a2ff59d 100644
--- a/examples/CMakeLists.txt
+++ b/examples/CMakeLists.txt
@@ -152,6 +152,9 @@ if (NOT CUTLASS_ENABLE_SYCL)
     57_hopper_grouped_gemm
     58_ada_fp8_gemm
     59_ampere_gather_scatter_conv
+    61_hopper_gemm_with_topk_and_softmax
+    62_hopper_sparse_gemm
+    63_hopper_gemm_with_weight_prefetch
     )
     add_subdirectory(${EXAMPLE})
   endforeach()
diff --git a/examples/cute/tutorial/tiled_copy.cu b/examples/cute/tutorial/tiled_copy.cu
index d370320b1b..87ad873ce6 100644
--- a/examples/cute/tutorial/tiled_copy.cu
+++ b/examples/cute/tutorial/tiled_copy.cu
@@ -186,8 +186,8 @@ int main(int argc, char** argv)
     return -1;
   }
   // Equivalent check to the above
-  if (not weakly_compatible(block_shape, tensor_shape)) {
-    std::cerr << "Expected the tensors to be weakly compatible with the block_shape." << std::endl;
+  if (not evenly_divides(tensor_shape, block_shape)) {
+    std::cerr << "Expected the block_shape to evenly divide the tensor shape." << std::endl;
     return -1;
   }
 
diff --git a/examples/cute/tutorial/tiled_copy_sycl.cpp b/examples/cute/tutorial/tiled_copy_sycl.cpp
index 3093e0351b..77c5484832 100644
--- a/examples/cute/tutorial/tiled_copy_sycl.cpp
+++ b/examples/cute/tutorial/tiled_copy_sycl.cpp
@@ -192,8 +192,8 @@ int main(int argc, char** argv)
     return -1;
   }
   // Equivalent check to the above
-  if (not weakly_compatible(block_shape, tensor_shape)) {
-    std::cerr << "Expected the tensors to be weakly compatible with the block_shape." << std::endl;
+  if (not evenly_divides(tensor_shape, block_shape)) {
+    std::cerr << "Expected the block_shape to evenly divide the tensor shape." << std::endl;
     return -1;
   }
 
diff --git a/include/cute/algorithm/clear.hpp b/include/cute/algorithm/clear.hpp
index 1c7dd5a334..0b3a8eaa1d 100644
--- a/include/cute/algorithm/clear.hpp
+++ b/include/cute/algorithm/clear.hpp
@@ -30,9 +30,9 @@
  **************************************************************************************************/
 #pragma once
 
-#include <cute/config.hpp>
-#include <cute/tensor_impl.hpp>
-#include <cute/algorithm/fill.hpp>
+#include <cute/config.hpp>          // CUTE_HOST_DEVICE
+#include <cute/tensor_impl.hpp>     // cute::Tensor
+#include <cute/algorithm/fill.hpp>  // cute::fill
 
 namespace cute
 {
diff --git a/include/cute/algorithm/cooperative_copy.hpp b/include/cute/algorithm/cooperative_copy.hpp
index b2be11717f..9d080116da 100644
--- a/include/cute/algorithm/cooperative_copy.hpp
+++ b/include/cute/algorithm/cooperative_copy.hpp
@@ -31,12 +31,14 @@
 #pragma once
 
 #include <cute/config.hpp>
-
-#include <cute/atom/copy_atom.hpp>
-#include <cute/algorithm/copy.hpp>
-
-#include <cute/tensor_impl.hpp>
+#include <cute/layout.hpp>
+#include <cute/layout_composed.hpp> // cute::logical_divide
+#include <cute/swizzle.hpp>         // cute::Swizzle
+#include <cute/swizzle_layout.hpp>  // cute::get_nonswizzle_portion
+#include <cute/tensor_impl.hpp>     // cute::Tensor
 #include <cute/tensor_predicate.hpp>
+#include <cute/algorithm/copy.hpp>
+#include <cute/atom/copy_atom.hpp>
 
 namespace cute
 {
diff --git a/include/cute/algorithm/cooperative_gemm.hpp b/include/cute/algorithm/cooperative_gemm.hpp
index da03bfbd11..2c91ce6f45 100644
--- a/include/cute/algorithm/cooperative_gemm.hpp
+++ b/include/cute/algorithm/cooperative_gemm.hpp
@@ -434,8 +434,8 @@ cooperative_gemm(uint32_t thread_idx,
   static_assert(is_convertible_v<decay_t<invoke_result_t<CStoreTransformOp, TypeC>>, TypeC>,
     "CStoreTransformOp functor must accept value of type TC::value_type and return value convertible to type TC::value_type");
 
-  static constexpr bool compat = weakly_compatible(tile_shape(TiledMMA<Args...>{}),
-                                                   make_shape(size<0>(sA), size<0>(sB), size<1>(sA)));
+  static constexpr bool compat = evenly_divides(make_shape(size<0>(sA), size<0>(sB), size<1>(sA)),
+                                                tile_shape(TiledMMA<Args...>{}));
   if constexpr (compat) {
     detail::cooperative_gemm_no_predication<SmemCopyOpA, SmemCopyOpB, SmemCopyOpC>(
         thread_idx, tiled_mma, alpha, sA, sB, beta, sC,
diff --git a/include/cute/algorithm/copy.hpp b/include/cute/algorithm/copy.hpp
index 2a37995eea..c2decd15d7 100644
--- a/include/cute/algorithm/copy.hpp
+++ b/include/cute/algorithm/copy.hpp
@@ -30,14 +30,10 @@
  **************************************************************************************************/
 #pragma once
 
-#include <cute/config.hpp>
-
-#include <cute/container/alignment.hpp>
-
-#include <cute/tensor_impl.hpp>
-#include <cute/tensor_predicate.hpp>
-
-#include <cute/atom/copy_atom.hpp>
+#include <cute/config.hpp>            // CUTE_HOST_DEVICE
+#include <cute/tensor_impl.hpp>       // cute::Tensor
+#include <cute/tensor_predicate.hpp>  // cute::TrivialPredTensor
+#include <cute/atom/copy_atom.hpp>    // cute::Copy_Atom
 
 namespace cute
 {
diff --git a/include/cute/algorithm/functional.hpp b/include/cute/algorithm/functional.hpp
index 8e7a58a5bc..ef80d018d7 100644
--- a/include/cute/algorithm/functional.hpp
+++ b/include/cute/algorithm/functional.hpp
@@ -30,10 +30,9 @@
  **************************************************************************************************/
 #pragma once
 
-#include <cute/config.hpp>
-
-#include <cute/util/type_traits.hpp>
-#include <cute/numeric/complex.hpp>
+#include <cute/config.hpp>          // CUTE_HOST_DEVICE
+#include <cute/numeric/math.hpp>    // cute::max, cute::min
+#include <cute/numeric/complex.hpp> // cute::conj
 
 /** C++14 <functional> extensions */
 
diff --git a/include/cute/algorithm/prefetch.hpp b/include/cute/algorithm/prefetch.hpp
index 0d638ab58f..c39f63acdd 100644
--- a/include/cute/algorithm/prefetch.hpp
+++ b/include/cute/algorithm/prefetch.hpp
@@ -30,11 +30,9 @@
  **************************************************************************************************/
 #pragma once
 
-#include <cute/config.hpp>
-
-#include <cute/tensor_impl.hpp>
-
-#include <cute/atom/copy_atom.hpp>
+#include <cute/config.hpp>          // CUTE_HOST_DEVICE
+#include <cute/tensor_impl.hpp>     // cute::Tensor
+#include <cute/atom/copy_atom.hpp>  // cute::Copy_Atom
 
 namespace cute
 {
diff --git a/include/cute/algorithm/tuple_algorithms.hpp b/include/cute/algorithm/tuple_algorithms.hpp
index 616960a54a..5a70f590b6 100644
--- a/include/cute/algorithm/tuple_algorithms.hpp
+++ b/include/cute/algorithm/tuple_algorithms.hpp
@@ -44,7 +44,7 @@
 /// Code guidelines and style preferences:
 ///
 /// For perfect forwarding, don't use std::forward, because it may not
-/// be defined in device code when compiling with NVRTC.  Instead, use
+/// be defined in device code when compiling with NVRTC. Instead, use
 /// `static_cast<ParameterType&&>(parameter_name)`.
 ///
 /// CuTe generally does not bother forwarding functions, as
@@ -52,24 +52,9 @@
 ///
 /// Throughout CUTLASS, cute::make_tuple always needs to be called
 /// namespace-qualified, EVEN If inside the cute namespace and/or in
-/// scope of a "using namespace cute" declaration.  Otherwise, the
+/// scope of a "using namespace cute" declaration. Otherwise, the
 /// compiler may select std::make_tuple instead of cute::make_tuple,
-/// due to argument-dependent lookup.  Two problems may result from
-/// that.
-///
-/// 1. Functions have an unexpected return type (std::tuple instead of
-///    cute::tuple), so functions that take cute::tuple parameters
-///    fail to compile (generally inside functions that have template
-///    parameters expected to be cute::tuple).
-///
-/// 2. std::tuple does not have the required __host__ __device__
-///    markings, so the CUDA compiler complains if you use it in
-///    device code.
-///
-/// cute::make_tuple will occur more often than std::make_tuple would
-/// in modern C++ code, because cute::tuple's design deprioritizes
-/// correct operation of CTAD (constructor template argument
-/// deduction) in favor of implementation simplicity.
+/// due to argument-dependent lookup.
 
 namespace cute
 {
@@ -145,6 +130,8 @@ transform_apply(T&& t, F&& f, G&& g)
   } else {
     return g(f(static_cast<T&&>(t)));
   }
+
+  CUTE_GCC_UNREACHABLE;
 }
 
 template <class T0, class T1, class F, class G>
@@ -157,6 +144,8 @@ transform_apply(T0&& t0, T1&& t1, F&& f, G&& g)
   } else {
     return g(f(static_cast<T0&&>(t0), static_cast<T1&&>(t1)));
   }
+
+  CUTE_GCC_UNREACHABLE;
 }
 
 template <class T0, class T1, class T2, class F, class G>
@@ -169,6 +158,8 @@ transform_apply(T0&& t0, T1&& t1, T2&& t2, F&& f, G&& g)
   } else {
     return g(f(static_cast<T0&&>(t0), static_cast<T1&&>(t1), static_cast<T2&&>(t2)));
   }
+
+  CUTE_GCC_UNREACHABLE;
 }
 
 //
@@ -349,7 +340,7 @@ auto
 all_of(T const& t, F&& f)
 {
   if constexpr (is_tuple<T>::value) {
-    return detail::apply(t, [&] (auto const&... a) { return (true_type{} && ... && f(a)); }, tuple_seq<T>{});
+    return detail::apply(cute::transform(t, f), [&] (auto const&... a) { return (true_type{} && ... && a); }, tuple_seq<T>{});
   } else {
     return f(t);
   }
@@ -401,71 +392,36 @@ filter_tuple(T0 const& t0, T1 const& t1, T2 const& t2, F&& f)
 
 namespace detail {
 
-// This impl compiles much faster than cute::apply and variadic args
-template <class T, class V, class F>
-CUTE_HOST_DEVICE constexpr
-auto
-fold(T&&, V&& v, F&&, seq<>)
-{
-  return v;
-}
-
-template <class T, class V, class F, int I0>
-CUTE_HOST_DEVICE constexpr
-auto
-fold(T&& t, V&& v, F&& f, seq<I0>)
-{
-  return f(static_cast<V&&>(v), get<I0>(static_cast<T&&>(t)));
-}
-
-template <class T, class V, class F, int I0, int I1>
-CUTE_HOST_DEVICE constexpr
-auto
-fold(T&& t, V&& v, F&& f, seq<I0,I1>)
-{
-  return f(f(static_cast<V&&>(v), get<I0>(static_cast<T&&>(t))), get<I1>(static_cast<T&&>(t)));
-}
-
-template <class T, class V, class F, int I0, int I1, int I2>
-CUTE_HOST_DEVICE constexpr
-auto
-fold(T&& t, V&& v, F&& f, seq<I0,I1,I2>)
-{
-  return f(f(f(static_cast<V&&>(v), get<I0>(static_cast<T&&>(t))), get<I1>(static_cast<T&&>(t))), get<I2>(static_cast<T&&>(t)));
-}
+template <class Fn, class Val>
+struct FoldAdaptor {
+  template <class X>
+  CUTE_HOST_DEVICE constexpr auto operator|(X&& x) {
+    auto r = fn_(val_, static_cast<X&&>(x));
+    return FoldAdaptor<Fn, decltype(r)>{fn_, r};
+  }
+  Fn fn_;
+  Val val_;
+};
 
-template <class T, class V, class F, int I0, int I1, int I2, int I3>
+template <class T, class V, class F, int... Is>
 CUTE_HOST_DEVICE constexpr
 auto
-fold(T&& t, V&& v, F&& f, seq<I0,I1,I2,I3>)
+fold(T&& t, V const& v, F&& f, seq<Is...>)
 {
-  return f(f(f(f(static_cast<V&&>(v), get<I0>(static_cast<T&&>(t))), get<I1>(static_cast<T&&>(t))), get<I2>(static_cast<T&&>(t))), get<I3>(static_cast<T&&>(t)));
+  return (FoldAdaptor<F,V>{f,v} | ... | get<Is>(static_cast<T&&>(t))).val_;
 }
 
-template <class T, class V, class F, int I0, int I1, int I2, int I3, int... Is>
-CUTE_HOST_DEVICE constexpr
-auto
-fold(T&& t, V&& v, F&& f, seq<I0,I1,I2,I3,Is...>)
-{
-  return fold(static_cast<T&&>(t),
-              f(f(f(f(static_cast<V&&>(v), get<I0>(static_cast<T&&>(t))), get<I1>(static_cast<T&&>(t))), get<I2>(static_cast<T&&>(t))), get<I3>(static_cast<T&&>(t))),
-              f,
-              seq<Is...>{});
-}
 } // end namespace detail
 
 template <class T, class V, class F>
 CUTE_HOST_DEVICE constexpr
 auto
-fold(T&& t, V&& v, F&& f)
+fold(T&& t, V const& v, F&& f)
 {
   if constexpr (is_tuple<remove_cvref_t<T>>::value) {
-    return detail::fold(static_cast<T&&>(t),
-                        static_cast<V&&>(v),
-                        f,
-                        tuple_seq<T>{});
+    return detail::fold(static_cast<T&&>(t), v, f, tuple_seq<T>{});
   } else {
-    return f(static_cast<V&&>(v), static_cast<T&&>(t));
+    return f(v, static_cast<T&&>(t));
   }
 
   CUTE_GCC_UNREACHABLE;
@@ -477,10 +433,7 @@ auto
 fold_first(T&& t, F&& f)
 {
   if constexpr (is_tuple<remove_cvref_t<T>>::value) {
-    return detail::fold(static_cast<T&&>(t),
-                        get<0>(static_cast<T&&>(t)),
-                        f,
-                        make_range<1,tuple_size<remove_cvref_t<T>>::value>{});
+    return detail::fold(static_cast<T&&>(t), get<0>(t), f, make_range<1,tuple_size<remove_cvref_t<T>>::value>{});
   } else {
     return t;
   }
@@ -536,13 +489,23 @@ CUTE_HOST_DEVICE constexpr
 auto
 take(T const& t)
 {
-  return detail::apply(t, [](auto const&... a) { return cute::make_tuple(a...); }, make_range<B,E>{});
+  if constexpr (E == -1) {
+    if constexpr (is_tuple<T>::value) {
+      return take<B,tuple_size<T>::value>(t);
+    } else {
+      return take<B,1>(t);
+    }
+  } else
+  if constexpr (B <= E) {
+    return detail::apply(t, [](auto const&... a) { return cute::make_tuple(a...); }, make_range<B,E>{});
+  } else {
+    static_assert(B <= E);
+  }
+
+  CUTE_GCC_UNREACHABLE;
 }
 
-//
 // Select tuple elements with given indices.
-//
-
 template <int... I, class T>
 CUTE_HOST_DEVICE constexpr
 auto
@@ -551,19 +514,6 @@ select(T const& t)
   return cute::make_tuple(get<I>(t)...);
 }
 
-template <class T, class Indices>
-CUTE_HOST_DEVICE constexpr
-auto
-select(T const& t, Indices const& indices)
-{
-  if constexpr (is_tuple<Indices>::value) {
-    return cute::transform(indices, [&t](auto i) { return select(t, i); });
-  } else {
-    static_assert(is_static<Indices>::value, "Order must be static");
-    return get<Indices::value>(t);
-  }
-}
-
 // Wrap non-tuples into rank-1 tuples or forward
 template <class T>
 CUTE_HOST_DEVICE constexpr
diff --git a/include/cute/arch/cluster_sm90.hpp b/include/cute/arch/cluster_sm90.hpp
index 27a34d7773..8fff51be8e 100644
--- a/include/cute/arch/cluster_sm90.hpp
+++ b/include/cute/arch/cluster_sm90.hpp
@@ -150,7 +150,7 @@ CUTE_DEVICE dim3 cluster_shape()
 }
 
 // Get 1D ctaid in a cluster.
-CUTLASS_DEVICE uint32_t block_rank_in_cluster()
+CUTE_DEVICE uint32_t block_rank_in_cluster()
 {
 #if defined(CUTE_ARCH_CLUSTER_SM90_ENABLED)
   uint32_t rank;
@@ -162,7 +162,7 @@ CUTLASS_DEVICE uint32_t block_rank_in_cluster()
 }
 
 // Set the destination block-ID in cluster for a given SMEM Address
-CUTLASS_DEVICE uint32_t set_block_rank(uint32_t smemAddr, uint32_t rank)
+CUTE_DEVICE uint32_t set_block_rank(uint32_t smemAddr, uint32_t rank)
 {
 #if defined(CUTE_ARCH_CLUSTER_SM90_ENABLED)
   uint32_t result;
diff --git a/include/cute/arch/config.hpp b/include/cute/arch/config.hpp
new file mode 100644
index 0000000000..84d7779a34
--- /dev/null
+++ b/include/cute/arch/config.hpp
@@ -0,0 +1,50 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+#pragma once
+
+#include <cutlass/arch/config.h> // CUTLASS_ARCH_MMA_SMxx_ENABLED
+
+// TMA instructions
+#if defined(CUTLASS_ARCH_MMA_SM90_ENABLED)
+#  define CUTE_ARCH_TMA_SM90_ENABLED
+#endif
+
+#if defined(CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90_ENABLED)
+#  define CUTE_ARCH_DEVICE_MODIFIABLE_TMA_SM90_ENABLED
+#endif
+
+// STSM
+#if defined(CUTLASS_ARCH_MMA_SM90_ENABLED)
+#  define CUTE_ARCH_STSM_SM90_ENABLED
+#endif
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
diff --git a/include/cute/arch/copy_sm50.hpp b/include/cute/arch/copy_sm50.hpp
index 9cf0efcdf5..925d9ebe37 100644
--- a/include/cute/arch/copy_sm50.hpp
+++ b/include/cute/arch/copy_sm50.hpp
@@ -40,8 +40,8 @@
 
 namespace cute
 {
-
-struct SM50_Shuffle_U32_2x2Trans
+// Shuffle data between thread pair (0, 1), (2, 3), etc.
+struct SM50_Shuffle_U32_2x2Trans_XOR1
 {
   using SRegisters = uint32_t[2];
   using DRegisters = uint32_t[2];
@@ -68,5 +68,31 @@ struct SM50_Shuffle_U32_2x2Trans
   }
 };
 
+// Shuffle data between thread pair (0, 4), (1, 5), etc.
+struct SM50_Shuffle_U32_2x2Trans_XOR4
+{
+  using SRegisters = uint32_t[2];
+  using DRegisters = uint32_t[2];
+
+  CUTE_HOST_DEVICE static void
+  copy(uint32_t const& src0, uint32_t const& src1, uint32_t& dst0, uint32_t& dst1)
+  {
+#if defined(CUTE_ARCH_WARP_SHUFFLE_ENABLED)
+    uint32_t x0 = threadIdx.x & 4  ? src0 : src1;
+    uint32_t y0 = __shfl_xor_sync(0xffffffff, x0, 4);
+
+    // Replace detination register with shuffle result.
+    if (threadIdx.x & 0x4) {
+      dst0 = y0;
+    } 
+    else {
+      dst1 = y0;
+    }
+#else 
+    CUTE_INVALID_CONTROL_PATH("Trying to use __shfl_xor_sync without CUTE_ARCH_WARP_SHUFFLE_ENABLED.");
+#endif
+  }
+};
+
 
 } // end namespace cute
diff --git a/include/cute/arch/copy_sm80.hpp b/include/cute/arch/copy_sm80.hpp
index 145e6b33c5..ab4fb68244 100644
--- a/include/cute/arch/copy_sm80.hpp
+++ b/include/cute/arch/copy_sm80.hpp
@@ -78,7 +78,7 @@ struct SM80_CP_ASYNC_CACHEGLOBAL
   using DRegisters = TD[1];
 
   static_assert(sizeof(TS) == sizeof(TD), "cp.async requires sizeof(src_value_type) == sizeof(dst_value_type)");
-  static_assert(sizeof(TS) == 4 || sizeof(TS) == 8 || sizeof(TS) == 16, "cp.async sizeof(TS) is not supported");
+  static_assert(sizeof(TS) == 16, "cp.async sizeof(TS) is not supported");
 
   CUTE_HOST_DEVICE static void
   copy(TS const& gmem_src,
@@ -135,7 +135,7 @@ struct SM80_CP_ASYNC_CACHEGLOBAL_ZFILL
   using DRegisters = TD[1];
 
   static_assert(sizeof(TS) == sizeof(TD), "cp.async requires sizeof(src_value_type) == sizeof(dst_value_type)");
-  static_assert(sizeof(TS) == 4 || sizeof(TS) == 8 || sizeof(TS) == 16, "cp.async sizeof(TS) is not supported");
+  static_assert(sizeof(TS) == 16, "cp.async sizeof(TS) is not supported");
 
   CUTE_HOST_DEVICE static void
   copy(TS const& gmem_src,
diff --git a/include/cute/arch/copy_sm90.hpp b/include/cute/arch/copy_sm90.hpp
index e5684ec469..bcb3b7d19c 100644
--- a/include/cute/arch/copy_sm90.hpp
+++ b/include/cute/arch/copy_sm90.hpp
@@ -30,21 +30,10 @@
  **************************************************************************************************/
 #pragma once
 
-#include <cute/config.hpp>
-
+#include <cute/config.hpp>      // CUTE_HOST_DEVICE
+#include <cute/arch/config.hpp> // CUTE_ARCH_TMA_SMxx_ENABLED
 #include <cute/arch/copy.hpp>
 
-// Config
-#if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900) && (__CUDACC_VER_MAJOR__ >= 12))
-#  define CUTE_ARCH_STSM_SM90_ENABLED
-#  define CUTE_ARCH_TMA_SM90_ENABLED
-#endif
-
-#if defined(CUTE_ARCH_TMA_SM90_ENABLED) && \
-  ((__CUDACC_VER_MAJOR__ > 12) || ((__CUDACC_VER_MAJOR__ == 12) && (__CUDACC_VER_MINOR__ >= 3)))
-#  define CUTE_ARCH_DEVICE_MODIFIABLE_TMA_SM90_ENABLED
-#endif
-
 namespace cute
 {
 
diff --git a/include/cute/arch/copy_sm90_desc.hpp b/include/cute/arch/copy_sm90_desc.hpp
index beb5964903..ee01151872 100644
--- a/include/cute/arch/copy_sm90_desc.hpp
+++ b/include/cute/arch/copy_sm90_desc.hpp
@@ -30,6 +30,8 @@
  **************************************************************************************************/
 #pragma once
 
+#include "cutlass/numeric_types.h"
+
 #if !defined(__CUDACC_RTC__) && !defined(CUTLASS_ENABLE_SYCL)
 #include <cuda.h>
 #include <cinttypes>
@@ -37,6 +39,8 @@
 
 #include <cute/config.hpp>
 
+#include <cute/arch/util.hpp>   // cute::cast_smem_ptr_to_uint
+#include <cute/arch/config.hpp> // CUTE_ARCH_TMA_SMxx_ENABLED
 #include <cute/arch/copy.hpp>
 #include <cute/arch/copy_sm90.hpp>
 
@@ -134,6 +138,10 @@ enum class SmemSwizzleBits : uint8_t {
   B128 = 3,
 };
 
+enum class SmemSwizzleBase : uint8_t {
+  SWIZZLE_BASE_16B         = 0,
+};
+
 enum class OOBFill : uint8_t {
   ZERO = 0,
   CONSTANT = 1,
@@ -201,13 +209,21 @@ to_CUtensorMapDataType() {
 }
 
 inline CUtensorMapSwizzle
-to_CUtensorMapSwizzle(SmemSwizzleBits const& t) {
+to_CUtensorMapSwizzle(SmemSwizzleBits const& t, SmemSwizzleBase const& b) {
   switch (t) {
-    default:                       assert(false && "Unknown SmemSwizzleBits!");
-    case SmemSwizzleBits::DISABLE: return CU_TENSOR_MAP_SWIZZLE_NONE;
-    case SmemSwizzleBits::B32:     return CU_TENSOR_MAP_SWIZZLE_32B;
-    case SmemSwizzleBits::B64:     return CU_TENSOR_MAP_SWIZZLE_64B;
-    case SmemSwizzleBits::B128:    return CU_TENSOR_MAP_SWIZZLE_128B;
+    default: assert(false && "Unsupported pair of SmemSwizzleBits and SmemSwizzleBase!");
+    case SmemSwizzleBits::DISABLE:
+      assert((b == SmemSwizzleBase::SWIZZLE_BASE_16B) && "Expected 16B swizzle base for 0B swizzle bits.");
+      return CU_TENSOR_MAP_SWIZZLE_NONE;
+    case SmemSwizzleBits::B32:
+      assert((b == SmemSwizzleBase::SWIZZLE_BASE_16B) && "Expected 16B swizzle base for 32B swizzle bits.");
+      return CU_TENSOR_MAP_SWIZZLE_32B;
+    case SmemSwizzleBits::B64:
+      assert((b == SmemSwizzleBase::SWIZZLE_BASE_16B) && "Expected 16B swizzle base for 64B swizzle bits.");
+      return CU_TENSOR_MAP_SWIZZLE_64B;
+    case SmemSwizzleBits::B128:
+      assert((b == SmemSwizzleBase::SWIZZLE_BASE_16B) && "Expected 16B swizzle base for 128B swizzle bits.");
+      return CU_TENSOR_MAP_SWIZZLE_128B;
   }
 }
 
@@ -282,7 +298,7 @@ tma_descriptor_replace_addr_in_global_mem(TmaDescriptor const* desc_ptr,
     "tensormap.replace.tile.global_address.global.b1024.b64 [%0], %1;"
     :: "l"(gmem_int_desc), "l"(new_desc_addr));
 #else
-  CUTE_INVALID_CONTROL_PATH("Using TMA Descriptor modification without CUTE_ARCH_TMA_SM90_ENABLED and CUDA 12.3");
+  CUTE_INVALID_CONTROL_PATH("Using TMA Descriptor modification without CUTE_ARCH_DEVICE_MODIFIABLE_TMA_SM90_ENABLED and CUDA 12.3");
 #endif
 }
 
@@ -295,15 +311,11 @@ tma_descriptor_replace_addr_in_shared_mem(TmaDescriptor& smem_desc,
 #if defined(CUTE_ARCH_DEVICE_MODIFIABLE_TMA_SM90_ENABLED)
   uint32_t smem_int_desc = cast_smem_ptr_to_uint(&smem_desc);
   uint64_t const new_desc_addr = reinterpret_cast<uint64_t>(new_tensor_ptr);
-  uint64_t const smem_int64_desc = 0;
-  asm volatile (
-    "cvt.u64.u32 %0, %1;"
-    :: "l"(smem_int64_desc), "r"(smem_int_desc));
   asm volatile (
     "tensormap.replace.tile.global_address.shared::cta.b1024.b64 [%0], %1;"
-    :: "l"(smem_int64_desc), "l"(new_desc_addr));
+    :: "r"(smem_int_desc), "l"(new_desc_addr));
 #else
-  CUTE_INVALID_CONTROL_PATH("Using TMA Descriptor modification without CUTE_ARCH_TMA_SM90_ENABLED and CUDA 12.3");
+  CUTE_INVALID_CONTROL_PATH("Using TMA Descriptor modification without CUTE_ARCH_DEVICE_MODIFIABLE_TMA_SM90_ENABLED and CUDA 12.3");
 #endif
 }
 
@@ -311,8 +323,8 @@ tma_descriptor_replace_addr_in_shared_mem(TmaDescriptor& smem_desc,
 CUTE_HOST_DEVICE
 void
 tma_descriptor_replace_dims_strides_in_shared_mem(TmaDescriptor                 & smem_desc,
-                                                  cute::array<uint32_t, 3> const& prob_shape,
-                                                  cute::array<uint64_t, 3> const& prob_stride)
+                                                  cute::array<uint32_t, 5> const& prob_shape,
+                                                  cute::array<uint64_t, 5> const& prob_stride)
 {
 #if defined(CUTE_ARCH_DEVICE_MODIFIABLE_TMA_SM90_ENABLED)
   uint32_t smem_int_desc = cast_smem_ptr_to_uint(&smem_desc);
@@ -329,25 +341,43 @@ tma_descriptor_replace_dims_strides_in_shared_mem(TmaDescriptor
   asm volatile (
     "tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [%0], 2, %1;"
     :: "l"(smem_int64_desc), "r"(prob_shape[2]));
+  asm volatile (
+    "tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [%0], 3, %1;"
+    :: "l"(smem_int64_desc), "r"(prob_shape[3]));
+  asm volatile (
+    "tensormap.replace.tile.global_dim.shared::cta.b1024.b32 [%0], 4, %1;"
+    :: "l"(smem_int64_desc), "r"(prob_shape[4]));
   // Strides must be a multiple of 16. Also, stride for the intermost dimension is implicitly 1
   #if ((__CUDACC_VER_MAJOR__ > 12) || ((__CUDACC_VER_MAJOR__ == 12) && (__CUDACC_VER_MINOR__ >= 5)))
-  // 4 LSBs are not included
   asm volatile (
     "tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [%0], 0, %1;"
     :: "l"(smem_int64_desc), "l"(prob_stride[1]));
   asm volatile (
     "tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [%0], 1, %1;"
     :: "l"(smem_int64_desc), "l"(prob_stride[2]));
+  asm volatile (
+    "tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [%0], 2, %1;"
+    :: "l"(smem_int64_desc), "l"(prob_stride[3]));
+  asm volatile (
+    "tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [%0], 3, %1;"
+    :: "l"(smem_int64_desc), "l"(prob_stride[4]));
   #else
+  // 4 LSBs are not included
   asm volatile (
     "tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [%0], 0, %1;"
     :: "l"(smem_int64_desc), "l"(prob_stride[1] >> 4));
   asm volatile (
     "tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [%0], 1, %1;"
     :: "l"(smem_int64_desc), "l"(prob_stride[2] >> 4));
+  asm volatile (
+    "tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [%0], 2, %1;"
+    :: "l"(smem_int64_desc), "l"(prob_stride[3] >> 4));
+  asm volatile (
+    "tensormap.replace.tile.global_stride.shared::cta.b1024.b64 [%0], 3, %1;"
+    :: "l"(smem_int64_desc), "l"(prob_stride[4] >> 4));
   #endif
 #else
-  CUTE_INVALID_CONTROL_PATH("Using TMA Descriptor modification without CUTE_ARCH_TMA_SM90_ENABLED and CUDA 12.3");
+  CUTE_INVALID_CONTROL_PATH("Using TMA Descriptor modification without CUTE_ARCH_DEVICE_MODIFIABLE_TMA_SM90_ENABLED and CUDA 12.3");
 #endif
 }
 
@@ -366,7 +396,7 @@ tma_descriptor_cp_fence_release(TmaDescriptor const* gmem_desc_ptr, TmaDescripto
     "tensormap.cp_fenceproxy.global.shared::cta.tensormap::generic.release.gpu.sync.aligned [%0], [%1], 128;"
     :: "l"(gmem_int_desc), "r"(smem_int_desc));
 #else
-  CUTE_INVALID_CONTROL_PATH("Using TMA Descriptor modification without CUTE_ARCH_TMA_SM90_ENABLED and CUDA 12.3");
+  CUTE_INVALID_CONTROL_PATH("Using TMA Descriptor modification without CUTE_ARCH_DEVICE_MODIFIABLE_TMA_SM90_ENABLED and CUDA 12.3");
 #endif
 }
 
@@ -381,7 +411,7 @@ tma_descriptor_fence_release()
 #if defined(CUTE_ARCH_DEVICE_MODIFIABLE_TMA_SM90_ENABLED)
   asm volatile ("fence.proxy.tensormap::generic.release.gpu;");
 #else
-  CUTE_INVALID_CONTROL_PATH("Using TMA Descriptor modification without CUTE_ARCH_TMA_SM90_ENABLED and CUDA 12.3");
+  CUTE_INVALID_CONTROL_PATH("Using TMA Descriptor modification without CUTE_ARCH_DEVICE_MODIFIABLE_TMA_SM90_ENABLED and CUDA 12.3");
 #endif
 }
 
@@ -400,13 +430,8 @@ tma_descriptor_fence_acquire(TmaDescriptor const* desc_ptr)
     :
     : "l"(gmem_int_desc)
     : "memory");
-  asm volatile (
-    "cvta.global.u64 %0, %0;"
-    :
-    : "l"(gmem_int_desc), "l"(gmem_int_desc)
-    : "memory");
 #else
-  CUTE_INVALID_CONTROL_PATH("Using TMA Descriptor modification without CUTE_ARCH_TMA_SM90_ENABLED and CUDA 12.3");
+  CUTE_INVALID_CONTROL_PATH("Using TMA Descriptor modification without CUTE_ARCH_DEVICE_MODIFIABLE_TMA_SM90_ENABLED and CUDA 12.3");
 #endif
 }
 
diff --git a/include/cute/arch/copy_sm90_tma.hpp b/include/cute/arch/copy_sm90_tma.hpp
index 1851482119..fb33d63cad 100644
--- a/include/cute/arch/copy_sm90_tma.hpp
+++ b/include/cute/arch/copy_sm90_tma.hpp
@@ -32,8 +32,11 @@
 
 #include <cute/config.hpp>
 
+#include <cute/arch/config.hpp> // CUTE_ARCH_TMA_SMxx_ENABLED
 #include <cute/arch/copy.hpp>
 #include <cute/arch/copy_sm90.hpp>
+#include "cutlass/arch/synclog.hpp"
+
 namespace cute
 {
 
@@ -52,6 +55,7 @@ struct SM90_TMA_LOAD_1D
     uint64_t gmem_int_desc = reinterpret_cast<uint64_t>(desc_ptr);
     uint32_t smem_int_mbar = cast_smem_ptr_to_uint(mbar_ptr);
     uint32_t smem_int_ptr  = cast_smem_ptr_to_uint(smem_ptr);
+    cutlass::arch::synclog_emit_tma_load(__LINE__, gmem_int_desc, smem_int_mbar, smem_int_ptr);
     asm volatile (
       "cp.async.bulk.tensor.1d.shared::cluster.global.mbarrier::complete_tx::bytes.L2::cache_hint"
       " [%0], [%1, {%3}], [%2], %4;"
@@ -97,6 +101,7 @@ struct SM90_TMA_LOAD_2D
     uint64_t gmem_int_desc = reinterpret_cast<uint64_t>(desc_ptr);
     uint32_t smem_int_mbar = cast_smem_ptr_to_uint(mbar_ptr);
     uint32_t smem_int_ptr  = cast_smem_ptr_to_uint(smem_ptr);
+    cutlass::arch::synclog_emit_tma_load(__LINE__, gmem_int_desc, smem_int_mbar, smem_int_ptr);
     asm volatile (
       "cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes.L2::cache_hint"
       " [%0], [%1, {%3, %4}], [%2], %5;"
@@ -142,6 +147,7 @@ struct SM90_TMA_LOAD_3D
     uint64_t gmem_int_desc = reinterpret_cast<uint64_t>(desc_ptr);
     uint32_t smem_int_mbar = cast_smem_ptr_to_uint(mbar_ptr);
     uint32_t smem_int_ptr  = cast_smem_ptr_to_uint(smem_ptr);
+    cutlass::arch::synclog_emit_tma_load(__LINE__, gmem_int_desc, smem_int_mbar, smem_int_ptr);
     asm volatile (
       "cp.async.bulk.tensor.3d.shared::cluster.global.mbarrier::complete_tx::bytes.L2::cache_hint"
       " [%0], [%1, {%3, %4, %5}], [%2], %6;"
@@ -187,6 +193,7 @@ struct SM90_TMA_LOAD_4D
     uint64_t gmem_int_desc = reinterpret_cast<uint64_t>(desc_ptr);
     uint32_t smem_int_mbar = cast_smem_ptr_to_uint(mbar_ptr);
     uint32_t smem_int_ptr  = cast_smem_ptr_to_uint(smem_ptr);
+    cutlass::arch::synclog_emit_tma_load(__LINE__, gmem_int_desc, smem_int_mbar, smem_int_ptr);
     asm volatile (
       "cp.async.bulk.tensor.4d.shared::cluster.global.mbarrier::complete_tx::bytes.L2::cache_hint"
       " [%0], [%1, {%3, %4, %5, %6}], [%2], %7;"
@@ -232,6 +239,7 @@ struct SM90_TMA_LOAD_5D
     uint64_t gmem_int_desc = reinterpret_cast<uint64_t>(desc_ptr);
     uint32_t smem_int_mbar = cast_smem_ptr_to_uint(mbar_ptr);
     uint32_t smem_int_ptr  = cast_smem_ptr_to_uint(smem_ptr);
+    cutlass::arch::synclog_emit_tma_load(__LINE__, gmem_int_desc, smem_int_mbar, smem_int_ptr);
     asm volatile (
       "cp.async.bulk.tensor.5d.shared::cluster.global.mbarrier::complete_tx::bytes.L2::cache_hint"
       " [%0], [%1, {%3, %4, %5, %6, %7}], [%2], %8;"
@@ -355,6 +363,7 @@ struct SM90_TMA_LOAD_IM2COL_3D
     uint64_t gmem_int_desc = reinterpret_cast<uint64_t>(desc_ptr);
     uint32_t smem_int_mbar = cast_smem_ptr_to_uint(mbar_ptr);
     uint32_t smem_int_ptr  = cast_smem_ptr_to_uint(smem_ptr);
+    cutlass::arch::synclog_emit_tma_load(__LINE__, gmem_int_desc, smem_int_mbar, smem_int_ptr);
     // Copy from global to shared::cluster.
     asm volatile (
       "cp.async.bulk.tensor.3d.shared::cluster.global.im2col.mbarrier::complete_tx::bytes"
@@ -405,6 +414,7 @@ struct SM90_TMA_LOAD_IM2COL_4D
     uint64_t gmem_int_desc = reinterpret_cast<uint64_t>(desc_ptr);
     uint32_t smem_int_mbar = cast_smem_ptr_to_uint(mbar_ptr);
     uint32_t smem_int_ptr  = cast_smem_ptr_to_uint(smem_ptr);
+    cutlass::arch::synclog_emit_tma_load(__LINE__, gmem_int_desc, smem_int_mbar, smem_int_ptr);
     // Copy from global to shared::cluster.
     asm volatile (
       "cp.async.bulk.tensor.4d.shared::cluster.global.im2col.mbarrier::complete_tx::bytes"
@@ -455,6 +465,7 @@ struct SM90_TMA_LOAD_IM2COL_5D
     uint64_t gmem_int_desc = reinterpret_cast<uint64_t>(desc_ptr);
     uint32_t smem_int_mbar = cast_smem_ptr_to_uint(mbar_ptr);
     uint32_t smem_int_ptr  = cast_smem_ptr_to_uint(smem_ptr);
+    cutlass::arch::synclog_emit_tma_load(__LINE__, gmem_int_desc, smem_int_mbar, smem_int_ptr);
     // Copy from global to shared::cluster.
     asm volatile (
       "cp.async.bulk.tensor.5d.shared::cluster.global.im2col.mbarrier::complete_tx::bytes"
@@ -565,7 +576,7 @@ struct SM90_TMA_LOAD_IM2COL
 struct SM90_TMA_LOAD_MULTICAST_1D
 {
   CUTE_HOST_DEVICE static void
-  copy(void const* desc_ptr, uint64_t* mbar_ptr, uint16_t multicast_mask,
+  copy(void const* desc_ptr, uint64_t* mbar_ptr, uint16_t multicast_mask, uint64_t cache_hint,
        void      * smem_ptr,
        int32_t const& crd0)
   {
@@ -573,13 +584,14 @@ struct SM90_TMA_LOAD_MULTICAST_1D
     uint64_t gmem_int_desc = reinterpret_cast<uint64_t>(desc_ptr);
     uint32_t smem_int_mbar = cast_smem_ptr_to_uint(mbar_ptr);
     uint32_t smem_int_ptr  = cast_smem_ptr_to_uint(smem_ptr);
+    cutlass::arch::synclog_emit_tma_load(__LINE__, gmem_int_desc, smem_int_mbar, smem_int_ptr);
     asm volatile (
-      "cp.async.bulk.tensor.1d.shared::cluster.global.mbarrier::complete_tx::bytes.multicast::cluster"
-      " [%0], [%1, {%4}], [%2], %3;"
+      "cp.async.bulk.tensor.1d.shared::cluster.global.mbarrier::complete_tx::bytes.multicast::cluster.L2::cache_hint"
+      " [%0], [%1, {%4}], [%2], %3, %5;"
       :
       : "r"(smem_int_ptr), "l"(gmem_int_desc), "r"(smem_int_mbar),
         "h"(multicast_mask),
-        "r"(crd0)
+        "r"(crd0), "l"(cache_hint)
       : "memory");
 #else
     CUTE_INVALID_CONTROL_PATH("Trying to use tma without CUTE_ARCH_TMA_SM90_ENABLED.");
@@ -590,7 +602,7 @@ struct SM90_TMA_LOAD_MULTICAST_1D
 struct SM90_TMA_LOAD_MULTICAST_2D
 {
   CUTE_HOST_DEVICE static void
-  copy(void const* desc_ptr, uint64_t* mbar_ptr, uint16_t multicast_mask,
+  copy(void const* desc_ptr, uint64_t* mbar_ptr, uint16_t multicast_mask, uint64_t cache_hint,
        void      * smem_ptr,
        int32_t const& crd0, int32_t const& crd1)
   {
@@ -598,13 +610,14 @@ struct SM90_TMA_LOAD_MULTICAST_2D
     uint64_t gmem_int_desc = reinterpret_cast<uint64_t>(desc_ptr);
     uint32_t smem_int_mbar = cast_smem_ptr_to_uint(mbar_ptr);
     uint32_t smem_int_ptr  = cast_smem_ptr_to_uint(smem_ptr);
+    cutlass::arch::synclog_emit_tma_load(__LINE__, gmem_int_desc, smem_int_mbar, smem_int_ptr);
     asm volatile (
-      "cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes.multicast::cluster"
-      " [%0], [%1, {%4, %5}], [%2], %3;"
+      "cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes.multicast::cluster.L2::cache_hint"
+      " [%0], [%1, {%4, %5}], [%2], %3, %6;"
       :
       : "r"(smem_int_ptr), "l"(gmem_int_desc), "r"(smem_int_mbar),
         "h"(multicast_mask),
-        "r"(crd0), "r"(crd1)
+        "r"(crd0), "r"(crd1), "l"(cache_hint)
       : "memory");
 #else
     CUTE_INVALID_CONTROL_PATH("Trying to use tma without CUTE_ARCH_TMA_SM90_ENABLED.");
@@ -615,7 +628,7 @@ struct SM90_TMA_LOAD_MULTICAST_2D
 struct SM90_TMA_LOAD_MULTICAST_3D
 {
   CUTE_HOST_DEVICE static void
-  copy(void const* desc_ptr, uint64_t* mbar_ptr, uint16_t multicast_mask,
+  copy(void const* desc_ptr, uint64_t* mbar_ptr, uint16_t multicast_mask, uint64_t cache_hint,
        void      * smem_ptr,
        int32_t const& crd0, int32_t const& crd1, int32_t const& crd2)
   {
@@ -623,13 +636,14 @@ struct SM90_TMA_LOAD_MULTICAST_3D
     uint64_t gmem_int_desc = reinterpret_cast<uint64_t>(desc_ptr);
     uint32_t smem_int_mbar = cast_smem_ptr_to_uint(mbar_ptr);
     uint32_t smem_int_ptr  = cast_smem_ptr_to_uint(smem_ptr);
+    cutlass::arch::synclog_emit_tma_load(__LINE__, gmem_int_desc, smem_int_mbar, smem_int_ptr);
     asm volatile (
-      "cp.async.bulk.tensor.3d.shared::cluster.global.mbarrier::complete_tx::bytes.multicast::cluster"
-      " [%0], [%1, {%4, %5, %6}], [%2], %3;"
+      "cp.async.bulk.tensor.3d.shared::cluster.global.mbarrier::complete_tx::bytes.multicast::cluster.L2::cache_hint"
+      " [%0], [%1, {%4, %5, %6}], [%2], %3, %7;"
       :
       : "r"(smem_int_ptr), "l"(gmem_int_desc), "r"(smem_int_mbar),
         "h"(multicast_mask),
-        "r"(crd0), "r"(crd1), "r"(crd2)
+        "r"(crd0), "r"(crd1), "r"(crd2), "l"(cache_hint)
       : "memory");
 #else
     CUTE_INVALID_CONTROL_PATH("Trying to use tma without CUTE_ARCH_TMA_SM90_ENABLED.");
@@ -640,7 +654,7 @@ struct SM90_TMA_LOAD_MULTICAST_3D
 struct SM90_TMA_LOAD_MULTICAST_4D
 {
   CUTE_HOST_DEVICE static void
-  copy(void const* desc_ptr, uint64_t* mbar_ptr, uint16_t multicast_mask,
+  copy(void const* desc_ptr, uint64_t* mbar_ptr, uint16_t multicast_mask, uint64_t cache_hint,
        void      * smem_ptr,
        int32_t const& crd0, int32_t const& crd1, int32_t const& crd2, int32_t const& crd3)
   {
@@ -648,13 +662,14 @@ struct SM90_TMA_LOAD_MULTICAST_4D
     uint64_t gmem_int_desc = reinterpret_cast<uint64_t>(desc_ptr);
     uint32_t smem_int_mbar = cast_smem_ptr_to_uint(mbar_ptr);
     uint32_t smem_int_ptr  = cast_smem_ptr_to_uint(smem_ptr);
+    cutlass::arch::synclog_emit_tma_load(__LINE__, gmem_int_desc, smem_int_mbar, smem_int_ptr);
     asm volatile (
-      "cp.async.bulk.tensor.4d.shared::cluster.global.mbarrier::complete_tx::bytes.multicast::cluster"
-      " [%0], [%1, {%4, %5, %6, %7}], [%2], %3;"
+      "cp.async.bulk.tensor.4d.shared::cluster.global.mbarrier::complete_tx::bytes.multicast::cluster.L2::cache_hint"
+      " [%0], [%1, {%4, %5, %6, %7}], [%2], %3, %8;"
       :
       : "r"(smem_int_ptr), "l"(gmem_int_desc), "r"(smem_int_mbar),
         "h"(multicast_mask),
-        "r"(crd0), "r"(crd1), "r"(crd2),  "r"(crd3)
+        "r"(crd0), "r"(crd1), "r"(crd2),  "r"(crd3), "l"(cache_hint)
       : "memory");
 #else
     CUTE_INVALID_CONTROL_PATH("Trying to use tma without CUTE_ARCH_TMA_SM90_ENABLED.");
@@ -665,7 +680,7 @@ struct SM90_TMA_LOAD_MULTICAST_4D
 struct SM90_TMA_LOAD_MULTICAST_5D
 {
   CUTE_HOST_DEVICE static void
-  copy(void const* desc_ptr, uint64_t* mbar_ptr, uint16_t multicast_mask,
+  copy(void const* desc_ptr, uint64_t* mbar_ptr, uint16_t multicast_mask, uint64_t cache_hint,
        void      * smem_ptr,
        int32_t const& crd0, int32_t const& crd1, int32_t const& crd2, int32_t const& crd3, int32_t const& crd4)
   {
@@ -673,13 +688,14 @@ struct SM90_TMA_LOAD_MULTICAST_5D
     uint64_t gmem_int_desc = reinterpret_cast<uint64_t>(desc_ptr);
     uint32_t smem_int_mbar = cast_smem_ptr_to_uint(mbar_ptr);
     uint32_t smem_int_ptr  = cast_smem_ptr_to_uint(smem_ptr);
+    cutlass::arch::synclog_emit_tma_load(__LINE__, gmem_int_desc, smem_int_mbar, smem_int_ptr);
     asm volatile (
-      "cp.async.bulk.tensor.5d.shared::cluster.global.mbarrier::complete_tx::bytes.multicast::cluster"
-      " [%0], [%1, {%4, %5, %6, %7, %8}], [%2], %3;"
+      "cp.async.bulk.tensor.5d.shared::cluster.global.mbarrier::complete_tx::bytes.multicast::cluster.L2::cache_hint"
+      " [%0], [%1, {%4, %5, %6, %7, %8}], [%2], %3, %9;"
       :
       : "r"(smem_int_ptr), "l"(gmem_int_desc), "r"(smem_int_mbar),
         "h"(multicast_mask),
-        "r"(crd0), "r"(crd1), "r"(crd2), "r"(crd3), "r"(crd4)
+        "r"(crd0), "r"(crd1), "r"(crd2), "r"(crd3), "r"(crd4), "l"(cache_hint)
       : "memory");
 #else
     CUTE_INVALID_CONTROL_PATH("Trying to use tma without CUTE_ARCH_TMA_SM90_ENABLED.");
@@ -690,39 +706,39 @@ struct SM90_TMA_LOAD_MULTICAST_5D
 struct SM90_TMA_LOAD_MULTICAST
 {
   CUTE_HOST_DEVICE static void
-  copy(void const* desc_ptr, uint64_t* mbar_ptr, uint16_t multicast_mask,
+  copy(void const* desc_ptr, uint64_t* mbar_ptr, uint16_t multicast_mask, uint64_t cache_hint,
        void      * smem_ptr,
        int32_t const& crd0)
   {
-    return SM90_TMA_LOAD_MULTICAST_1D::copy(desc_ptr, mbar_ptr, multicast_mask, smem_ptr, crd0);
+    return SM90_TMA_LOAD_MULTICAST_1D::copy(desc_ptr, mbar_ptr, multicast_mask, cache_hint, smem_ptr, crd0);
   }
   CUTE_HOST_DEVICE static void
-  copy(void const* desc_ptr, uint64_t* mbar_ptr, uint16_t multicast_mask,
+  copy(void const* desc_ptr, uint64_t* mbar_ptr, uint16_t multicast_mask, uint64_t cache_hint,
        void      * smem_ptr,
        int32_t const& crd0, int32_t const& crd1)
   {
-    return SM90_TMA_LOAD_MULTICAST_2D::copy(desc_ptr, mbar_ptr, multicast_mask, smem_ptr, crd0, crd1);
+    return SM90_TMA_LOAD_MULTICAST_2D::copy(desc_ptr, mbar_ptr, multicast_mask, cache_hint, smem_ptr, crd0, crd1);
   }
   CUTE_HOST_DEVICE static void
-  copy(void const* desc_ptr, uint64_t* mbar_ptr, uint16_t multicast_mask,
+  copy(void const* desc_ptr, uint64_t* mbar_ptr, uint16_t multicast_mask, uint64_t cache_hint,
        void      * smem_ptr,
        int32_t const& crd0, int32_t const& crd1, int32_t const& crd2)
   {
-    return SM90_TMA_LOAD_MULTICAST_3D::copy(desc_ptr, mbar_ptr, multicast_mask, smem_ptr, crd0, crd1, crd2);
+    return SM90_TMA_LOAD_MULTICAST_3D::copy(desc_ptr, mbar_ptr, multicast_mask, cache_hint, smem_ptr, crd0, crd1, crd2);
   }
   CUTE_HOST_DEVICE static void
-  copy(void const* desc_ptr, uint64_t* mbar_ptr, uint16_t multicast_mask,
+  copy(void const* desc_ptr, uint64_t* mbar_ptr, uint16_t multicast_mask, uint64_t cache_hint,
        void      * smem_ptr,
        int32_t const& crd0, int32_t const& crd1, int32_t const& crd2, int32_t const& crd3)
   {
-    return SM90_TMA_LOAD_MULTICAST_4D::copy(desc_ptr, mbar_ptr, multicast_mask, smem_ptr, crd0, crd1, crd2, crd3);
+    return SM90_TMA_LOAD_MULTICAST_4D::copy(desc_ptr, mbar_ptr, multicast_mask, cache_hint, smem_ptr, crd0, crd1, crd2, crd3);
   }
   CUTE_HOST_DEVICE static void
-  copy(void const* desc_ptr, uint64_t* mbar_ptr, uint16_t multicast_mask,
+  copy(void const* desc_ptr, uint64_t* mbar_ptr, uint16_t multicast_mask, uint64_t cache_hint,
        void      * smem_ptr,
        int32_t const& crd0, int32_t const& crd1, int32_t const& crd2, int32_t const& crd3, int32_t const& crd4)
   {
-    return SM90_TMA_LOAD_MULTICAST_5D::copy(desc_ptr, mbar_ptr, multicast_mask, smem_ptr, crd0, crd1, crd2, crd3, crd4);
+    return SM90_TMA_LOAD_MULTICAST_5D::copy(desc_ptr, mbar_ptr, multicast_mask, cache_hint, smem_ptr, crd0, crd1, crd2, crd3, crd4);
   }
 
   using PREFETCH = typename SM90_TMA_LOAD::PREFETCH;
@@ -744,6 +760,7 @@ struct SM90_TMA_LOAD_IM2COL_MULTICAST_3D
     uint64_t gmem_int_desc = reinterpret_cast<uint64_t>(desc_ptr);
     uint32_t smem_int_mbar = cast_smem_ptr_to_uint(mbar_ptr);
     uint32_t smem_int_ptr  = cast_smem_ptr_to_uint(smem_ptr);
+    cutlass::arch::synclog_emit_tma_load(__LINE__, gmem_int_desc, smem_int_mbar, smem_int_ptr);
     // Copy from global to shared::cluster.
     asm volatile (
       "cp.async.bulk.tensor.3d.shared::cluster.global.im2col.mbarrier::complete_tx::bytes.multicast::cluster"
@@ -772,6 +789,7 @@ struct SM90_TMA_LOAD_IM2COL_MULTICAST_4D
     uint64_t gmem_int_desc = reinterpret_cast<uint64_t>(desc_ptr);
     uint32_t smem_int_mbar = cast_smem_ptr_to_uint(mbar_ptr);
     uint32_t smem_int_ptr  = cast_smem_ptr_to_uint(smem_ptr);
+    cutlass::arch::synclog_emit_tma_load(__LINE__, gmem_int_desc, smem_int_mbar, smem_int_ptr);
     // Copy from global to shared::cluster.
     asm volatile (
       "cp.async.bulk.tensor.4d.shared::cluster.global.im2col.mbarrier::complete_tx::bytes.multicast::cluster"
@@ -800,6 +818,7 @@ struct SM90_TMA_LOAD_IM2COL_MULTICAST_5D
     uint64_t gmem_int_desc = reinterpret_cast<uint64_t>(desc_ptr);
     uint32_t smem_int_mbar = cast_smem_ptr_to_uint(mbar_ptr);
     uint32_t smem_int_ptr  = cast_smem_ptr_to_uint(smem_ptr);
+    cutlass::arch::synclog_emit_tma_load(__LINE__, gmem_int_desc, smem_int_mbar, smem_int_ptr);
     // Copy from global to shared::cluster.
     asm volatile (
       "cp.async.bulk.tensor.5d.shared::cluster.global.im2col.mbarrier::complete_tx::bytes.multicast::cluster"
@@ -871,6 +890,7 @@ struct SM90_TMA_STORE_1D
 #if defined(CUTE_ARCH_TMA_SM90_ENABLED)
     uint64_t gmem_int_desc = reinterpret_cast<uint64_t>(desc_ptr);
     uint32_t smem_int_ptr  = cast_smem_ptr_to_uint(smem_ptr);
+    cutlass::arch::synclog_emit_tma_store(__LINE__, gmem_int_desc, smem_int_ptr);
     asm volatile (
       "cp.async.bulk.tensor.1d.global.shared::cta.bulk_group [%0, {%2}], [%1];"
       :
@@ -893,6 +913,7 @@ struct SM90_TMA_STORE_2D
 #if defined(CUTE_ARCH_TMA_SM90_ENABLED)
     uint64_t gmem_int_desc = reinterpret_cast<uint64_t>(desc_ptr);
     uint32_t smem_int_ptr  = cast_smem_ptr_to_uint(smem_ptr);
+    cutlass::arch::synclog_emit_tma_store(__LINE__, gmem_int_desc, smem_int_ptr);
     asm volatile (
       "cp.async.bulk.tensor.2d.global.shared::cta.bulk_group [%0, {%2, %3}], [%1];"
       :
@@ -915,6 +936,7 @@ struct SM90_TMA_STORE_3D
 #if defined(CUTE_ARCH_TMA_SM90_ENABLED)
     uint64_t gmem_int_desc = reinterpret_cast<uint64_t>(desc_ptr);
     uint32_t smem_int_ptr  = cast_smem_ptr_to_uint(smem_ptr);
+    cutlass::arch::synclog_emit_tma_store(__LINE__, gmem_int_desc, smem_int_ptr);
     asm volatile (
       "cp.async.bulk.tensor.3d.global.shared::cta.bulk_group [%0, {%2, %3, %4}], [%1];"
       :
@@ -937,6 +959,7 @@ struct SM90_TMA_STORE_4D
 #if defined(CUTE_ARCH_TMA_SM90_ENABLED)
     uint64_t gmem_int_desc = reinterpret_cast<uint64_t>(desc_ptr);
     uint32_t smem_int_ptr  = cast_smem_ptr_to_uint(smem_ptr);
+    cutlass::arch::synclog_emit_tma_store(__LINE__, gmem_int_desc, smem_int_ptr);
     asm volatile (
       "cp.async.bulk.tensor.4d.global.shared::cta.bulk_group [%0, {%2, %3, %4, %5}], [%1];"
       :
@@ -959,6 +982,7 @@ struct SM90_TMA_STORE_5D
 #if defined(CUTE_ARCH_TMA_SM90_ENABLED)
     uint64_t gmem_int_desc = reinterpret_cast<uint64_t>(desc_ptr);
     uint32_t smem_int_ptr  = cast_smem_ptr_to_uint(smem_ptr);
+    cutlass::arch::synclog_emit_tma_store(__LINE__, gmem_int_desc, smem_int_ptr);
     asm volatile (
       "cp.async.bulk.tensor.5d.global.shared::cta.bulk_group [%0, {%2, %3, %4, %5, %6}], [%1];"
       :
@@ -1024,6 +1048,7 @@ struct SM90_TMA_STORE_IM2COL_3D
 #if defined(CUTE_ARCH_TMA_SM90_ENABLED)
     uint64_t gmem_int_desc = reinterpret_cast<uint64_t>(desc_ptr);
     uint32_t smem_int_ptr  = cast_smem_ptr_to_uint(smem_ptr);
+    cutlass::arch::synclog_emit_tma_store(__LINE__, gmem_int_desc, smem_int_ptr);
     asm volatile (
       "cp.async.bulk.tensor.3d.global.shared::cta.im2col_no_offs.bulk_group"
       " [%0, {%2, %3, %4}], [%1];"
@@ -1047,6 +1072,7 @@ struct SM90_TMA_STORE_IM2COL_4D
 #if defined(CUTE_ARCH_TMA_SM90_ENABLED)
     uint64_t gmem_int_desc = reinterpret_cast<uint64_t>(desc_ptr);
     uint32_t smem_int_ptr  = cast_smem_ptr_to_uint(smem_ptr);
+    cutlass::arch::synclog_emit_tma_store(__LINE__, gmem_int_desc, smem_int_ptr);
     asm volatile (
       "cp.async.bulk.tensor.4d.global.shared::cta.im2col_no_offs.bulk_group"
       " [%0, {%2, %3, %4, %5}], [%1];"
@@ -1070,6 +1096,7 @@ struct SM90_TMA_STORE_IM2COL_5D
 #if defined(CUTE_ARCH_TMA_SM90_ENABLED)
     uint64_t gmem_int_desc = reinterpret_cast<uint64_t>(desc_ptr);
     uint32_t smem_int_ptr  = cast_smem_ptr_to_uint(smem_ptr);
+    cutlass::arch::synclog_emit_tma_store(__LINE__, gmem_int_desc, smem_int_ptr);
     asm volatile (
       "cp.async.bulk.tensor.5d.global.shared::cta.im2col_no_offs.bulk_group"
       " [%0, {%2, %3, %4, %5, %6}], [%1];"
@@ -1112,6 +1139,7 @@ struct SM90_TMA_STORE_IM2COL
 CUTE_HOST_DEVICE static void
 tma_store_fence() {
 #if defined(CUTE_ARCH_TMA_SM90_ENABLED)
+    cutlass::arch::synclog_emit_fence_view_async_shared(__LINE__);
     asm volatile ("fence.proxy.async.shared::cta;");
 #elif defined(__CUDA_ARCH__)
     CUTE_INVALID_CONTROL_PATH("Trying to use tma without CUTE_ARCH_TMA_SM90_ENABLED.");
@@ -1122,6 +1150,7 @@ tma_store_fence() {
 CUTE_HOST_DEVICE static void
 tma_store_arrive() {
 #if defined(CUTE_ARCH_TMA_SM90_ENABLED)
+    cutlass::arch::synclog_emit_tma_store_arrive(__LINE__);
     asm volatile("cp.async.bulk.commit_group;");
 #else
     CUTE_INVALID_CONTROL_PATH("Trying to use tma without CUTE_ARCH_TMA_SM90_ENABLED.");
@@ -1138,6 +1167,7 @@ tma_store_wait() {
       :
       : "n"(Count)
       : "memory");
+    cutlass::arch::synclog_emit_tma_store_wait(__LINE__, Count);
 #else
     CUTE_INVALID_CONTROL_PATH("Trying to use tma without CUTE_ARCH_TMA_SM90_ENABLED.");
 #endif
@@ -1157,6 +1187,7 @@ struct SM90_TMA_REDUCE_ADD_1D
 #if defined(CUTE_ARCH_TMA_SM90_ENABLED)
     uint64_t gmem_int_desc = reinterpret_cast<uint64_t>(desc_ptr);
     uint32_t smem_int_ptr  = cast_smem_ptr_to_uint(smem_ptr);
+    cutlass::arch::synclog_emit_tma_store(__LINE__, gmem_int_desc, smem_int_ptr);
     asm volatile (
       "cp.reduce.async.bulk.tensor.1d.global.shared::cta.add.bulk_group [%0, {%2}], [%1];"
       :
@@ -1179,6 +1210,7 @@ struct SM90_TMA_REDUCE_ADD_2D
 #if defined(CUTE_ARCH_TMA_SM90_ENABLED)
     uint64_t gmem_int_desc = reinterpret_cast<uint64_t>(desc_ptr);
     uint32_t smem_int_ptr  = cast_smem_ptr_to_uint(smem_ptr);
+    cutlass::arch::synclog_emit_tma_store(__LINE__, gmem_int_desc, smem_int_ptr);
     asm volatile (
       "cp.reduce.async.bulk.tensor.2d.global.shared::cta.add.bulk_group [%0, {%2, %3}], [%1];"
       :
@@ -1201,6 +1233,7 @@ struct SM90_TMA_REDUCE_ADD_3D
 #if defined(CUTE_ARCH_TMA_SM90_ENABLED)
     uint64_t gmem_int_desc = reinterpret_cast<uint64_t>(desc_ptr);
     uint32_t smem_int_ptr  = cast_smem_ptr_to_uint(smem_ptr);
+    cutlass::arch::synclog_emit_tma_store(__LINE__, gmem_int_desc, smem_int_ptr);
     asm volatile (
       "cp.reduce.async.bulk.tensor.3d.global.shared::cta.add.bulk_group [%0, {%2, %3, %4}], [%1];"
       :
@@ -1223,6 +1256,7 @@ struct SM90_TMA_REDUCE_ADD_4D
 #if defined(CUTE_ARCH_TMA_SM90_ENABLED)
     uint64_t gmem_int_desc = reinterpret_cast<uint64_t>(desc_ptr);
     uint32_t smem_int_ptr  = cast_smem_ptr_to_uint(smem_ptr);
+    cutlass::arch::synclog_emit_tma_store(__LINE__, gmem_int_desc, smem_int_ptr);
     asm volatile (
       "cp.reduce.async.bulk.tensor.4d.global.shared::cta.add.bulk_group [%0, {%2, %3, %4, %5}], [%1];"
       :
@@ -1245,6 +1279,7 @@ struct SM90_TMA_REDUCE_ADD_5D
 #if defined(CUTE_ARCH_TMA_SM90_ENABLED)
     uint64_t gmem_int_desc = reinterpret_cast<uint64_t>(desc_ptr);
     uint32_t smem_int_ptr  = cast_smem_ptr_to_uint(smem_ptr);
+    cutlass::arch::synclog_emit_tma_store(__LINE__, gmem_int_desc, smem_int_ptr);
     asm volatile (
       "cp.reduce.async.bulk.tensor.5d.global.shared::cta.add.bulk_group [%0, {%2, %3, %4, %5, %6}], [%1];"
       :
diff --git a/include/cute/arch/mma.hpp b/include/cute/arch/mma.hpp
index 5bfda7463c..6e06114a6c 100644
--- a/include/cute/arch/mma.hpp
+++ b/include/cute/arch/mma.hpp
@@ -30,9 +30,9 @@
  **************************************************************************************************/
 #pragma once
 
-#include <cute/config.hpp>
-
-#include <cute/arch/util.hpp>
+#include <cute/config.hpp>           // CUTE_HOST_DEVICE
+#include <cute/numeric/complex.hpp>  // cute::fma
+#include <cute/numeric/real.hpp>     // cute::fma
 
 namespace cute
 {
diff --git a/include/cute/arch/mma_sm80.hpp b/include/cute/arch/mma_sm80.hpp
index 5c552f61b9..cedea7c33d 100644
--- a/include/cute/arch/mma_sm80.hpp
+++ b/include/cute/arch/mma_sm80.hpp
@@ -2142,4 +2142,103 @@ struct SM80_16x8x256_S32U1U1S32_TN_XORPOPC
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+// MMA 8x8x128 TN
+struct SM80_8x8x128_S32U1U1S32_TN_ANDPOPC
+{
+  using DRegisters = uint32_t[2];
+  using ARegisters = uint32_t[1];
+  using BRegisters = uint32_t[1];
+  using CRegisters = uint32_t[2];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t      & d0, uint32_t      & d1,
+      uint32_t const& a0,
+      uint32_t const& b0,
+      uint32_t const& c0, uint32_t const& c1)
+  {
+#if defined(CUTE_ARCH_MMA_B1_AND_SM80_ENABLED)
+    asm volatile(
+      "mma.sync.aligned.m8n8k128.row.col.s32.b1.b1.s32.and.popc "
+      "{%0, %1},"
+      "{%2},"
+      "{%3},"
+      "{%4, %5};\n"
+      : "=r"(d0), "=r"(d1)
+      :  "r"(a0),
+         "r"(b0),
+         "r"(c0),  "r"(c1));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM80_8x8x128_S32U1U1S32_TN_ANDPOPC without CUTE_ARCH_MMA_SM80_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// MMA 16x8x128 TN
+struct SM80_16x8x128_S32U1U1S32_TN_ANDPOPC
+{
+  using DRegisters = uint32_t[4];
+  using ARegisters = uint32_t[2];
+  using BRegisters = uint32_t[1];
+  using CRegisters = uint32_t[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t const& a0, uint32_t const& a1,
+      uint32_t const& b0,
+      uint32_t const& c0, uint32_t const& c1, uint32_t const& c2, uint32_t const& c3)
+  {
+#if defined(CUTE_ARCH_MMA_B1_AND_SM80_ENABLED)
+    asm volatile(
+      "mma.sync.aligned.m16n8k128.row.col.s32.b1.b1.s32.and.popc "
+      "{%0,  %1,  %2,  %3},"
+      "{%4,  %5},"
+      "{%6},"
+      "{%7,  %8,  %9,  %10};\n"
+      : "=r"(d0), "=r"(d1), "=r"(d2), "=r"(d3)
+      :  "r"(a0),  "r"(a1),
+         "r"(b0),
+         "r"(c0),  "r"(c1),  "r"(c2),  "r"(c3));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM80_16x8x128_S32U1U1S32_TN_ANDPOPC without CUTE_ARCH_MMA_SM80_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// MMA 16x8x256 TN
+struct SM80_16x8x256_S32U1U1S32_TN_ANDPOPC
+{
+  using DRegisters = uint32_t[4];
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint32_t[2];
+  using CRegisters = uint32_t[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint32_t const& b0, uint32_t const& b1,
+      uint32_t const& c0, uint32_t const& c1, uint32_t const& c2, uint32_t const& c3)
+  {
+#if defined(CUTE_ARCH_MMA_B1_AND_SM80_ENABLED)
+    asm volatile(
+      "mma.sync.aligned.m16n8k256.row.col.s32.b1.b1.s32.and.popc "
+      "{%0,  %1,  %2,  %3},"
+      "{%4,  %5,  %6,  %7},"
+      "{%8,  %9},"
+      "{%10, %11, %12, %13};\n"
+      : "=r"(d0), "=r"(d1), "=r"(d2), "=r"(d3)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "r"(b0),  "r"(b1),
+         "r"(c0),  "r"(c1),  "r"(c2),  "r"(c3));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM80_16x8x256_S32U1U1S32_TN_ANDPOPC without CUTE_ARCH_MMA_SM80_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
 } // namespace cute
diff --git a/include/cute/arch/mma_sm90.hpp b/include/cute/arch/mma_sm90.hpp
index d504bf39df..51d34563c4 100644
--- a/include/cute/arch/mma_sm90.hpp
+++ b/include/cute/arch/mma_sm90.hpp
@@ -32,7 +32,6 @@
 #pragma once
 
 #include <cute/config.hpp>
-
 #include <cute/arch/mma.hpp>
 
 // Config
@@ -45,10 +44,12 @@
 
 namespace cute {
 
+namespace SM90 {
+
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
 // MMA 16x8x4 TN
-struct SM90_16x8x4_F64F64F64F64_TN
+struct MMA_16x8x4_F64F64F64F64_TN
 {
   using DRegisters = double[4];
   using ARegisters = double[2];
@@ -73,7 +74,7 @@ struct SM90_16x8x4_F64F64F64F64_TN
          "d"(b0),
          "d"(c0),  "d"(c1),  "d"(c2),  "d"(c3));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_16x8x4_F64F64F64F64_TN without CUTE_ARCH_MMA_SM90_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_16x8x4_F64F64F64F64_TN without CUTE_ARCH_MMA_SM90_ENABLED");
 #endif
   }
 };
@@ -81,7 +82,7 @@ struct SM90_16x8x4_F64F64F64F64_TN
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
 // MMA 16x8x8 TN
-struct SM90_16x8x8_F64F64F64F64_TN
+struct MMA_16x8x8_F64F64F64F64_TN
 {
   using DRegisters = double[4];
   using ARegisters = double[4];
@@ -106,7 +107,7 @@ struct SM90_16x8x8_F64F64F64F64_TN
          "d"(b0),  "d"(b1),
          "d"(c0),  "d"(c1),  "d"(c2),  "d"(c3));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_16x8x8_F64F64F64F64_TN without CUTE_ARCH_MMA_SM90_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_16x8x8_F64F64F64F64_TN without CUTE_ARCH_MMA_SM90_ENABLED");
 #endif
   }
 };
@@ -114,7 +115,7 @@ struct SM90_16x8x8_F64F64F64F64_TN
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
 // MMA 16x8x16 TN
-struct SM90_16x8x16_F64F64F64F64_TN
+struct MMA_16x8x16_F64F64F64F64_TN
 {
   using DRegisters = double[4];
   using ARegisters = double[8];
@@ -141,7 +142,7 @@ struct SM90_16x8x16_F64F64F64F64_TN
          "d"(b0),  "d"(b1),  "d"(b2),  "d"(b3),
          "d"(c0),  "d"(c1),  "d"(c2),  "d"(c3));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_16x8x16_F64F64F64F64_TN without CUTE_ARCH_MMA_SM90_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_16x8x16_F64F64F64F64_TN without CUTE_ARCH_MMA_SM90_ENABLED");
 #endif
   }
 };
@@ -149,7 +150,7 @@ struct SM90_16x8x16_F64F64F64F64_TN
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
 // MMA 16x8x4 TN
-struct SM90_16x8x4_C64C64C64C64_TN
+struct MMA_16x8x4_C64C64C64C64_TN
 {
   using DRegisters = complex<double>[4];
   using ARegisters = complex<double>[2];
@@ -175,28 +176,28 @@ struct SM90_16x8x4_C64C64C64C64_TN
     double& id3 = reinterpret_cast<double(&)[2]>(d3)[1];
 
     // d.real() =  a.real() * b.real() + c.real();
-    SM90_16x8x4_F64F64F64F64_TN::fma(
+    MMA_16x8x4_F64F64F64F64_TN::fma(
       rd0, rd1, rd2, rd3,
       a0.real(), a1.real(),
       b0.real(),
       c0.real(), c1.real(), c2.real(), c3.real());
 
     // d.imag() =  a.imag() * b.real() + c.imag();
-    SM90_16x8x4_F64F64F64F64_TN::fma(
+    MMA_16x8x4_F64F64F64F64_TN::fma(
       id0, id1, id2, id3,
       a0.imag(), a1.imag(),
       b0.real(),
       c0.imag(), c1.imag(), c2.imag(), c3.imag());
 
     // d.real() = -a.imag() * b.imag() + d.real();
-    SM90_16x8x4_F64F64F64F64_TN::fma(
+    MMA_16x8x4_F64F64F64F64_TN::fma(
       rd0, rd1, rd2, rd3,
       -a0.imag(), -a1.imag(),
       b0.imag(),
       d0.real(), d1.real(), d2.real(), d3.real());
 
     // d.imag() =  a.real() * b.imag() + d.imag();
-    SM90_16x8x4_F64F64F64F64_TN::fma(
+    MMA_16x8x4_F64F64F64F64_TN::fma(
       id0, id1, id2, id3,
       a0.real(), a1.real(),
       b0.imag(),
@@ -207,7 +208,7 @@ struct SM90_16x8x4_C64C64C64C64_TN
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
 // MMA 16x8x8 TN
-struct SM90_16x8x8_C64C64C64C64_TN
+struct MMA_16x8x8_C64C64C64C64_TN
 {
   using DRegisters = complex<double>[4];
   using ARegisters = complex<double>[4];
@@ -234,28 +235,28 @@ struct SM90_16x8x8_C64C64C64C64_TN
     double& id3 = reinterpret_cast<double(&)[2]>(d3)[1];
 
     // d.real() =  a.real() * b.real() + c.real();
-    SM90_16x8x8_F64F64F64F64_TN::fma(
+    MMA_16x8x8_F64F64F64F64_TN::fma(
       rd0, rd1, rd2, rd3,
       a0.real(), a1.real(), a2.real(), a3.real(),
       b0.real(), b1.real(),
       c0.real(), c1.real(), c2.real(), c3.real());
 
     // d.imag() =  a.imag() * b.real() + c.imag();
-    SM90_16x8x8_F64F64F64F64_TN::fma(
+    MMA_16x8x8_F64F64F64F64_TN::fma(
       id0, id1, id2, id3,
       a0.imag(), a1.imag(), a2.imag(), a3.imag(),
       b0.real(), b1.real(),
       c0.imag(), c1.imag(), c2.imag(), c3.imag());
 
     // d.real() = -a.imag() * b.imag() + d.real();
-    SM90_16x8x8_F64F64F64F64_TN::fma(
+    MMA_16x8x8_F64F64F64F64_TN::fma(
       rd0, rd1, rd2, rd3,
       -a0.imag(), -a1.imag(), -a2.imag(), -a3.imag(),
       b0.imag(), b1.imag(),
       d0.real(), d1.real(), d2.real(), d3.real());
 
     // d.imag() =  a.real() * b.imag() + d.imag();
-    SM90_16x8x8_F64F64F64F64_TN::fma(
+    MMA_16x8x8_F64F64F64F64_TN::fma(
       id0, id1, id2, id3,
       a0.real(), a1.real(), a2.real(), a3.real(),
       b0.imag(), b1.imag(),
@@ -266,7 +267,7 @@ struct SM90_16x8x8_C64C64C64C64_TN
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
 // MMA 16x8x16 TN
-struct SM90_16x8x16_C64C64C64C64_TN
+struct MMA_16x8x16_C64C64C64C64_TN
 {
   using DRegisters = complex<double>[4];
   using ARegisters = complex<double>[8];
@@ -296,7 +297,7 @@ struct SM90_16x8x16_C64C64C64C64_TN
     double& id3 = reinterpret_cast<double(&)[2]>(d3)[1];
 
     // d.real() =  a.real() * b.real() + c.real();
-    SM90_16x8x16_F64F64F64F64_TN::fma(
+    MMA_16x8x16_F64F64F64F64_TN::fma(
       rd0, rd1, rd2, rd3,
       a0.real(), a1.real(), a2.real(), a3.real(),
       a4.real(), a5.real(), a6.real(), a7.real(),
@@ -304,7 +305,7 @@ struct SM90_16x8x16_C64C64C64C64_TN
       c0.real(), c1.real(), c2.real(), c3.real());
 
     // d.imag() =  a.imag() * b.real() + c.imag();
-    SM90_16x8x16_F64F64F64F64_TN::fma(
+    MMA_16x8x16_F64F64F64F64_TN::fma(
       id0, id1, id2, id3,
       a0.imag(), a1.imag(), a2.imag(), a3.imag(),
       a4.imag(), a5.imag(), a6.imag(), a7.imag(),
@@ -312,7 +313,7 @@ struct SM90_16x8x16_C64C64C64C64_TN
       c0.imag(), c1.imag(), c2.imag(), c3.imag());
 
     // d.real() = -a.imag() * b.imag() + d.real();
-    SM90_16x8x16_F64F64F64F64_TN::fma(
+    MMA_16x8x16_F64F64F64F64_TN::fma(
       rd0, rd1, rd2, rd3,
       -a0.imag(), -a1.imag(), -a2.imag(), -a3.imag(),
       -a4.imag(), -a5.imag(), -a6.imag(), -a7.imag(),
@@ -320,7 +321,7 @@ struct SM90_16x8x16_C64C64C64C64_TN
       d0.real(), d1.real(), d2.real(), d3.real());
 
     // d.imag() =  a.real() * b.imag() + d.imag();
-    SM90_16x8x16_F64F64F64F64_TN::fma(
+    MMA_16x8x16_F64F64F64F64_TN::fma(
       id0, id1, id2, id3,
       a0.real(), a1.real(), a2.real(), a3.real(),
       a4.real(), a5.real(), a6.real(), a7.real(),
@@ -331,17 +332,24 @@ struct SM90_16x8x16_C64C64C64C64_TN
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+}
+
 } // namespace cute
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
 #include <cute/arch/mma_sm90_desc.hpp>
 #include <cute/arch/mma_sm90_gmma.hpp>
+#include <cute/arch/mma_sm90_gmma_sparse.hpp>
+#include <cute/layout.hpp>                     // cute::size
+#include <cute/numeric/integral_constant.hpp>  // cute::is_static
+#include <cute/numeric/numeric_types.hpp>      // cute::half_t, cute::float_e4m3_t, cute::tfloat32_t, etc
+#include <cute/util/type_traits.hpp>           // cute::is_same_v
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
 namespace cute {
-namespace GMMA {
+namespace SM90::GMMA {
 
 template <
   class ElementA,
@@ -370,73 +378,148 @@ ss_op_selector()
       static_assert(size<2>(TileShape_MNK{}) % 16 == 0, "Tile_K must be a multiple of 16.");
 
       if constexpr (Tile_N % 256 == 0) {
-        return SM90_64x256x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x256x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
       }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::MMA_64x248x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 240 == 0) {
-        return SM90_64x240x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x240x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::MMA_64x232x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 224 == 0) {
-        return SM90_64x224x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x224x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::MMA_64x216x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 208 == 0) {
-        return SM90_64x208x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x208x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::MMA_64x200x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 192 == 0) {
-        return SM90_64x192x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x192x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
       }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::MMA_64x184x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 176 == 0) {
-        return SM90_64x176x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x176x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::MMA_64x168x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 160 == 0) {
-        return SM90_64x160x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x160x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::MMA_64x152x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 144 == 0) {
-        return SM90_64x144x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x144x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::MMA_64x136x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 128 == 0) {
-        return SM90_64x128x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x128x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
       }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::MMA_64x120x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 112 == 0) {
-        return SM90_64x112x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x112x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::MMA_64x104x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 96 == 0) {
-        return SM90_64x96x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x96x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::MMA_64x88x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 80 == 0) {
-        return SM90_64x80x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x80x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::MMA_64x72x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 64 == 0) {
-        return SM90_64x64x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x64x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
       }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::MMA_64x56x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 48 == 0) {
-        return SM90_64x48x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x48x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::MMA_64x40x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 32 == 0) {
-        return SM90_64x32x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x32x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::MMA_64x24x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
       }
+#endif
       else if constexpr (Tile_N % 16 == 0) {
-        return SM90_64x16x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x16x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
       }
       else if constexpr (Tile_N % 8 == 0) {
-        return SM90_64x8x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x8x16_F16F16F16_SS<MajorA, MajorB, Args...>{};
       }
       else {
         static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
@@ -450,73 +533,148 @@ ss_op_selector()
       static_assert(size<2>(TileShape_MNK{}) % 32 == 0, "Tile_K must be a multiple of 32.");
 
       if constexpr (Tile_N % 256 == 0) {
-        return SM90_64x256x32_F16E4M3E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x256x32_F16E4M3E4M3_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::MMA_64x248x32_F16E4M3E4M3_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 240 == 0) {
-        return SM90_64x240x32_F16E4M3E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x240x32_F16E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::MMA_64x232x32_F16E4M3E4M3_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 224 == 0) {
-        return SM90_64x224x32_F16E4M3E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x224x32_F16E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::MMA_64x216x32_F16E4M3E4M3_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 208 == 0) {
-        return SM90_64x208x32_F16E4M3E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x208x32_F16E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::MMA_64x200x32_F16E4M3E4M3_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 192 == 0) {
-        return SM90_64x192x32_F16E4M3E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x192x32_F16E4M3E4M3_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::MMA_64x184x32_F16E4M3E4M3_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 176 == 0) {
-        return SM90_64x176x32_F16E4M3E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x176x32_F16E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::MMA_64x168x32_F16E4M3E4M3_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 160 == 0) {
-        return SM90_64x160x32_F16E4M3E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x160x32_F16E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::MMA_64x152x32_F16E4M3E4M3_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 144 == 0) {
-        return SM90_64x144x32_F16E4M3E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x144x32_F16E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::MMA_64x136x32_F16E4M3E4M3_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 128 == 0) {
-        return SM90_64x128x32_F16E4M3E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x128x32_F16E4M3E4M3_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::MMA_64x120x32_F16E4M3E4M3_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 112 == 0) {
-        return SM90_64x112x32_F16E4M3E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x112x32_F16E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::MMA_64x104x32_F16E4M3E4M3_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 96 == 0) {
-        return SM90_64x96x32_F16E4M3E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x96x32_F16E4M3E4M3_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::MMA_64x88x32_F16E4M3E4M3_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 80 == 0) {
-        return SM90_64x80x32_F16E4M3E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x80x32_F16E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::MMA_64x72x32_F16E4M3E4M3_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 64 == 0) {
-        return SM90_64x64x32_F16E4M3E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x64x32_F16E4M3E4M3_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::MMA_64x56x32_F16E4M3E4M3_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 48 == 0) {
-        return SM90_64x48x32_F16E4M3E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x48x32_F16E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::MMA_64x40x32_F16E4M3E4M3_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 32 == 0) {
-        return SM90_64x32x32_F16E4M3E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x32x32_F16E4M3E4M3_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::MMA_64x24x32_F16E4M3E4M3_SS_TN<Args...>{};
       }
+#endif
       else if constexpr (Tile_N % 16 == 0) {
-        return SM90_64x16x32_F16E4M3E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x16x32_F16E4M3E4M3_SS_TN<Args...>{};
       }
       else if constexpr (Tile_N % 8 == 0) {
-        return SM90_64x8x32_F16E4M3E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x8x32_F16E4M3E4M3_SS_TN<Args...>{};
       }
       else {
         static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
@@ -530,73 +688,148 @@ ss_op_selector()
       static_assert(size<2>(TileShape_MNK{}) % 32 == 0, "Tile_K must be a multiple of 32.");
 
       if constexpr (Tile_N % 256 == 0) {
-        return SM90_64x256x32_F16E4M3E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x256x32_F16E4M3E5M2_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::MMA_64x248x32_F16E4M3E5M2_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 240 == 0) {
-        return SM90_64x240x32_F16E4M3E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x240x32_F16E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::MMA_64x232x32_F16E4M3E5M2_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 224 == 0) {
-        return SM90_64x224x32_F16E4M3E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x224x32_F16E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::MMA_64x216x32_F16E4M3E5M2_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 208 == 0) {
-        return SM90_64x208x32_F16E4M3E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x208x32_F16E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::MMA_64x200x32_F16E4M3E5M2_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 192 == 0) {
-        return SM90_64x192x32_F16E4M3E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x192x32_F16E4M3E5M2_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::MMA_64x184x32_F16E4M3E5M2_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 176 == 0) {
-        return SM90_64x176x32_F16E4M3E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x176x32_F16E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::MMA_64x168x32_F16E4M3E5M2_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 160 == 0) {
-        return SM90_64x160x32_F16E4M3E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x160x32_F16E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::MMA_64x152x32_F16E4M3E5M2_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 144 == 0) {
-        return SM90_64x144x32_F16E4M3E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x144x32_F16E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::MMA_64x136x32_F16E4M3E5M2_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 128 == 0) {
-        return SM90_64x128x32_F16E4M3E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x128x32_F16E4M3E5M2_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::MMA_64x120x32_F16E4M3E5M2_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 112 == 0) {
-        return SM90_64x112x32_F16E4M3E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x112x32_F16E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::MMA_64x104x32_F16E4M3E5M2_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 96 == 0) {
-        return SM90_64x96x32_F16E4M3E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x96x32_F16E4M3E5M2_SS_TN<Args...>{};
       }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::MMA_64x88x32_F16E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 80 == 0) {
-        return SM90_64x80x32_F16E4M3E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x80x32_F16E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::MMA_64x72x32_F16E4M3E5M2_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 64 == 0) {
-        return SM90_64x64x32_F16E4M3E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x64x32_F16E4M3E5M2_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::MMA_64x56x32_F16E4M3E5M2_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 48 == 0) {
-        return SM90_64x48x32_F16E4M3E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x48x32_F16E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::MMA_64x40x32_F16E4M3E5M2_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 32 == 0) {
-        return SM90_64x32x32_F16E4M3E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x32x32_F16E4M3E5M2_SS_TN<Args...>{};
       }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::MMA_64x24x32_F16E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
       else if constexpr (Tile_N % 16 == 0) {
-        return SM90_64x16x32_F16E4M3E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x16x32_F16E4M3E5M2_SS_TN<Args...>{};
       }
       else if constexpr (Tile_N % 8 == 0) {
-        return SM90_64x8x32_F16E4M3E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x8x32_F16E4M3E5M2_SS_TN<Args...>{};
       }
       else {
         static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
@@ -610,73 +843,148 @@ ss_op_selector()
       static_assert(size<2>(TileShape_MNK{}) % 32 == 0, "Tile_K must be a multiple of 32.");
 
       if constexpr (Tile_N % 256 == 0) {
-        return SM90_64x256x32_F16E5M2E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x256x32_F16E5M2E4M3_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::MMA_64x248x32_F16E5M2E4M3_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 240 == 0) {
-        return SM90_64x240x32_F16E5M2E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x240x32_F16E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::MMA_64x232x32_F16E5M2E4M3_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 224 == 0) {
-        return SM90_64x224x32_F16E5M2E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x224x32_F16E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::MMA_64x216x32_F16E5M2E4M3_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 208 == 0) {
-        return SM90_64x208x32_F16E5M2E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x208x32_F16E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::MMA_64x200x32_F16E5M2E4M3_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 192 == 0) {
-        return SM90_64x192x32_F16E5M2E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x192x32_F16E5M2E4M3_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::MMA_64x184x32_F16E5M2E4M3_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 176 == 0) {
-        return SM90_64x176x32_F16E5M2E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x176x32_F16E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::MMA_64x168x32_F16E5M2E4M3_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 160 == 0) {
-        return SM90_64x160x32_F16E5M2E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x160x32_F16E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::MMA_64x152x32_F16E5M2E4M3_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 144 == 0) {
-        return SM90_64x144x32_F16E5M2E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x144x32_F16E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::MMA_64x136x32_F16E5M2E4M3_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 128 == 0) {
-        return SM90_64x128x32_F16E5M2E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x128x32_F16E5M2E4M3_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::MMA_64x120x32_F16E5M2E4M3_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 112 == 0) {
-        return SM90_64x112x32_F16E5M2E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x112x32_F16E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::MMA_64x104x32_F16E5M2E4M3_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 96 == 0) {
-        return SM90_64x96x32_F16E5M2E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x96x32_F16E5M2E4M3_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::MMA_64x88x32_F16E5M2E4M3_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 80 == 0) {
-        return SM90_64x80x32_F16E5M2E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x80x32_F16E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::MMA_64x72x32_F16E5M2E4M3_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 64 == 0) {
-        return SM90_64x64x32_F16E5M2E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x64x32_F16E5M2E4M3_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::MMA_64x56x32_F16E5M2E4M3_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 48 == 0) {
-        return SM90_64x48x32_F16E5M2E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x48x32_F16E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::MMA_64x40x32_F16E5M2E4M3_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 32 == 0) {
-        return SM90_64x32x32_F16E5M2E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x32x32_F16E5M2E4M3_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::MMA_64x24x32_F16E5M2E4M3_SS_TN<Args...>{};
       }
+#endif
       else if constexpr (Tile_N % 16 == 0) {
-        return SM90_64x16x32_F16E5M2E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x16x32_F16E5M2E4M3_SS_TN<Args...>{};
       }
       else if constexpr (Tile_N % 8 == 0) {
-        return SM90_64x8x32_F16E5M2E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x8x32_F16E5M2E4M3_SS_TN<Args...>{};
       }
       else {
         static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
@@ -690,73 +998,148 @@ ss_op_selector()
       static_assert(size<2>(TileShape_MNK{}) % 32 == 0, "Tile_K must be a multiple of 32.");
 
       if constexpr (Tile_N % 256 == 0) {
-        return SM90_64x256x32_F16E5M2E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x256x32_F16E5M2E5M2_SS_TN<Args...>{};
       }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::MMA_64x248x32_F16E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 240 == 0) {
-        return SM90_64x240x32_F16E5M2E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x240x32_F16E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::MMA_64x232x32_F16E5M2E5M2_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 224 == 0) {
-        return SM90_64x224x32_F16E5M2E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x224x32_F16E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::MMA_64x216x32_F16E5M2E5M2_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 208 == 0) {
-        return SM90_64x208x32_F16E5M2E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x208x32_F16E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::MMA_64x200x32_F16E5M2E5M2_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 192 == 0) {
-        return SM90_64x192x32_F16E5M2E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x192x32_F16E5M2E5M2_SS_TN<Args...>{};
       }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::MMA_64x184x32_F16E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 176 == 0) {
-        return SM90_64x176x32_F16E5M2E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x176x32_F16E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::MMA_64x168x32_F16E5M2E5M2_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 160 == 0) {
-        return SM90_64x160x32_F16E5M2E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x160x32_F16E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::MMA_64x152x32_F16E5M2E5M2_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 144 == 0) {
-        return SM90_64x144x32_F16E5M2E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x144x32_F16E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::MMA_64x136x32_F16E5M2E5M2_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 128 == 0) {
-        return SM90_64x128x32_F16E5M2E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x128x32_F16E5M2E5M2_SS_TN<Args...>{};
       }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::MMA_64x120x32_F16E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 112 == 0) {
-        return SM90_64x112x32_F16E5M2E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x112x32_F16E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::MMA_64x104x32_F16E5M2E5M2_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 96 == 0) {
-        return SM90_64x96x32_F16E5M2E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x96x32_F16E5M2E5M2_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::MMA_64x88x32_F16E5M2E5M2_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 80 == 0) {
-        return SM90_64x80x32_F16E5M2E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x80x32_F16E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::MMA_64x72x32_F16E5M2E5M2_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 64 == 0) {
-        return SM90_64x64x32_F16E5M2E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x64x32_F16E5M2E5M2_SS_TN<Args...>{};
       }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::MMA_64x56x32_F16E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 48 == 0) {
-        return SM90_64x48x32_F16E5M2E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x48x32_F16E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::MMA_64x40x32_F16E5M2E5M2_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 32 == 0) {
-        return SM90_64x32x32_F16E5M2E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x32x32_F16E5M2E5M2_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::MMA_64x24x32_F16E5M2E5M2_SS_TN<Args...>{};
       }
+#endif
       else if constexpr (Tile_N % 16 == 0) {
-        return SM90_64x16x32_F16E5M2E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x16x32_F16E5M2E5M2_SS_TN<Args...>{};
       }
       else if constexpr (Tile_N % 8 == 0) {
-        return SM90_64x8x32_F16E5M2E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x8x32_F16E5M2E5M2_SS_TN<Args...>{};
       }
       else {
         static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
@@ -776,73 +1159,148 @@ ss_op_selector()
       static_assert(size<2>(TileShape_MNK{}) % 16 == 0, "Tile_K must be a multiple of 16.");
 
       if constexpr (Tile_N % 256 == 0) {
-        return SM90_64x256x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x256x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::MMA_64x248x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 240 == 0) {
-        return SM90_64x240x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x240x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::MMA_64x232x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 224 == 0) {
-        return SM90_64x224x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x224x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::MMA_64x216x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 208 == 0) {
-        return SM90_64x208x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x208x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::MMA_64x200x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 192 == 0) {
-        return SM90_64x192x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x192x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::MMA_64x184x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 176 == 0) {
-        return SM90_64x176x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x176x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::MMA_64x168x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 160 == 0) {
-        return SM90_64x160x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x160x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-      else if constexpr (Tile_N % 144 == 0) {
-        return SM90_64x144x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::MMA_64x152x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
-      else if constexpr (Tile_N % 128 == 0) {
-        return SM90_64x128x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 144 == 0) {
+        return SM90::GMMA::MMA_64x144x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-      else if constexpr (Tile_N % 112 == 0) {
-        return SM90_64x112x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::MMA_64x136x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
-      else if constexpr (Tile_N % 96 == 0) {
-        return SM90_64x96x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
+      else if constexpr (Tile_N % 128 == 0) {
+        return SM90::GMMA::MMA_64x128x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
       }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-      else if constexpr (Tile_N % 80 == 0) {
-        return SM90_64x80x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::MMA_64x120x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 112 == 0) {
+        return SM90::GMMA::MMA_64x112x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::MMA_64x104x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 96 == 0) {
+        return SM90::GMMA::MMA_64x96x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::MMA_64x88x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 80 == 0) {
+        return SM90::GMMA::MMA_64x80x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::MMA_64x72x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 64 == 0) {
-        return SM90_64x64x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x64x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::MMA_64x56x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 48 == 0) {
-        return SM90_64x48x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x48x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::MMA_64x40x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 32 == 0) {
-        return SM90_64x32x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x32x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::MMA_64x24x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
       }
+#endif
       else if constexpr (Tile_N % 16 == 0) {
-        return SM90_64x16x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x16x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
       }
       else if constexpr (Tile_N % 8 == 0) {
-        return SM90_64x8x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x8x16_F32F16F16_SS<MajorA, MajorB, Args...>{};
       }
       else {
         static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
@@ -854,73 +1312,148 @@ ss_op_selector()
       static_assert(size<2>(TileShape_MNK{}) % 16 == 0, "Tile_K must be a multiple of 16.");
 
       if constexpr (Tile_N % 256 == 0) {
-        return SM90_64x256x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x256x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
       }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::MMA_64x248x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 240 == 0) {
-        return SM90_64x240x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x240x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::MMA_64x232x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 224 == 0) {
-        return SM90_64x224x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x224x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::MMA_64x216x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 208 == 0) {
-        return SM90_64x208x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x208x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::MMA_64x200x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 192 == 0) {
-        return SM90_64x192x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x192x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
       }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::MMA_64x184x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 176 == 0) {
-        return SM90_64x176x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x176x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::MMA_64x168x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 160 == 0) {
-        return SM90_64x160x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x160x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::MMA_64x152x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 144 == 0) {
-        return SM90_64x144x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x144x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::MMA_64x136x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 128 == 0) {
-        return SM90_64x128x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x128x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
       }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::MMA_64x120x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 112 == 0) {
-        return SM90_64x112x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x112x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::MMA_64x104x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 96 == 0) {
-        return SM90_64x96x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x96x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::MMA_64x88x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 80 == 0) {
-        return SM90_64x80x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x80x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::MMA_64x72x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 64 == 0) {
-        return SM90_64x64x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x64x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
       }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::MMA_64x56x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 48 == 0) {
-        return SM90_64x48x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x48x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::MMA_64x40x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 32 == 0) {
-        return SM90_64x32x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x32x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::MMA_64x24x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
       }
+#endif
       else if constexpr (Tile_N % 16 == 0) {
-        return SM90_64x16x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x16x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
       }
       else if constexpr (Tile_N % 8 == 0) {
-        return SM90_64x8x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::MMA_64x8x16_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
       }
       else {
         static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
@@ -934,73 +1467,148 @@ ss_op_selector()
       static_assert(size<2>(TileShape_MNK{}) % 8 == 0, "Tile_K must be a multiple of 8.");
 
       if constexpr (Tile_N % 256 == 0) {
-        return SM90_64x256x8_F32TF32TF32_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x256x8_F32TF32TF32_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::MMA_64x248x8_F32TF32TF32_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 240 == 0) {
-        return SM90_64x240x8_F32TF32TF32_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x240x8_F32TF32TF32_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::MMA_64x232x8_F32TF32TF32_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 224 == 0) {
-        return SM90_64x224x8_F32TF32TF32_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x224x8_F32TF32TF32_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::MMA_64x216x8_F32TF32TF32_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 208 == 0) {
-        return SM90_64x208x8_F32TF32TF32_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x208x8_F32TF32TF32_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::MMA_64x200x8_F32TF32TF32_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 192 == 0) {
-        return SM90_64x192x8_F32TF32TF32_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x192x8_F32TF32TF32_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::MMA_64x184x8_F32TF32TF32_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 176 == 0) {
-        return SM90_64x176x8_F32TF32TF32_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x176x8_F32TF32TF32_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::MMA_64x168x8_F32TF32TF32_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 160 == 0) {
-        return SM90_64x160x8_F32TF32TF32_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x160x8_F32TF32TF32_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::MMA_64x152x8_F32TF32TF32_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 144 == 0) {
-        return SM90_64x144x8_F32TF32TF32_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x144x8_F32TF32TF32_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::MMA_64x136x8_F32TF32TF32_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 128 == 0) {
-        return SM90_64x128x8_F32TF32TF32_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x128x8_F32TF32TF32_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::MMA_64x120x8_F32TF32TF32_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 112 == 0) {
-        return SM90_64x112x8_F32TF32TF32_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x112x8_F32TF32TF32_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::MMA_64x104x8_F32TF32TF32_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 96 == 0) {
-        return SM90_64x96x8_F32TF32TF32_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x96x8_F32TF32TF32_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::MMA_64x88x8_F32TF32TF32_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 80 == 0) {
-        return SM90_64x80x8_F32TF32TF32_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x80x8_F32TF32TF32_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::MMA_64x72x8_F32TF32TF32_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 64 == 0) {
-        return SM90_64x64x8_F32TF32TF32_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x64x8_F32TF32TF32_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::MMA_64x56x8_F32TF32TF32_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 48 == 0) {
-        return SM90_64x48x8_F32TF32TF32_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x48x8_F32TF32TF32_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::MMA_64x40x8_F32TF32TF32_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 32 == 0) {
-        return SM90_64x32x8_F32TF32TF32_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x32x8_F32TF32TF32_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::MMA_64x24x8_F32TF32TF32_SS_TN<Args...>{};
       }
+#endif
       else if constexpr (Tile_N % 16 == 0) {
-        return SM90_64x16x8_F32TF32TF32_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x16x8_F32TF32TF32_SS_TN<Args...>{};
       }
       else if constexpr (Tile_N % 8 == 0) {
-        return SM90_64x8x8_F32TF32TF32_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x8x8_F32TF32TF32_SS_TN<Args...>{};
       }
       else {
         static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
@@ -1014,73 +1622,148 @@ ss_op_selector()
       static_assert(size<2>(TileShape_MNK{}) % 32 == 0, "Tile_K must be a multiple of 32.");
 
       if constexpr (Tile_N % 256 == 0) {
-        return SM90_64x256x32_F32E4M3E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x256x32_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::MMA_64x248x32_F32E4M3E4M3_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 240 == 0) {
-        return SM90_64x240x32_F32E4M3E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x240x32_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::MMA_64x232x32_F32E4M3E4M3_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 224 == 0) {
-        return SM90_64x224x32_F32E4M3E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x224x32_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::MMA_64x216x32_F32E4M3E4M3_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 208 == 0) {
-        return SM90_64x208x32_F32E4M3E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x208x32_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::MMA_64x200x32_F32E4M3E4M3_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 192 == 0) {
-        return SM90_64x192x32_F32E4M3E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x192x32_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::MMA_64x184x32_F32E4M3E4M3_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 176 == 0) {
-        return SM90_64x176x32_F32E4M3E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x176x32_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::MMA_64x168x32_F32E4M3E4M3_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 160 == 0) {
-        return SM90_64x160x32_F32E4M3E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x160x32_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::MMA_64x152x32_F32E4M3E4M3_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 144 == 0) {
-        return SM90_64x144x32_F32E4M3E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x144x32_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::MMA_64x136x32_F32E4M3E4M3_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 128 == 0) {
-        return SM90_64x128x32_F32E4M3E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x128x32_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::MMA_64x120x32_F32E4M3E4M3_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 112 == 0) {
-        return SM90_64x112x32_F32E4M3E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x112x32_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::MMA_64x104x32_F32E4M3E4M3_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 96 == 0) {
-        return SM90_64x96x32_F32E4M3E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x96x32_F32E4M3E4M3_SS_TN<Args...>{};
       }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::MMA_64x88x32_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 80 == 0) {
-        return SM90_64x80x32_F32E4M3E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x80x32_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::MMA_64x72x32_F32E4M3E4M3_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 64 == 0) {
-        return SM90_64x64x32_F32E4M3E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x64x32_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::MMA_64x56x32_F32E4M3E4M3_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 48 == 0) {
-        return SM90_64x48x32_F32E4M3E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x48x32_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::MMA_64x40x32_F32E4M3E4M3_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 32 == 0) {
-        return SM90_64x32x32_F32E4M3E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x32x32_F32E4M3E4M3_SS_TN<Args...>{};
       }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::MMA_64x24x32_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
       else if constexpr (Tile_N % 16 == 0) {
-        return SM90_64x16x32_F32E4M3E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x16x32_F32E4M3E4M3_SS_TN<Args...>{};
       }
       else if constexpr (Tile_N % 8 == 0) {
-        return SM90_64x8x32_F32E4M3E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x8x32_F32E4M3E4M3_SS_TN<Args...>{};
       }
       else {
         static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
@@ -1094,73 +1777,148 @@ ss_op_selector()
       static_assert(size<2>(TileShape_MNK{}) % 32 == 0, "Tile_K must be a multiple of 32.");
 
       if constexpr (Tile_N % 256 == 0) {
-        return SM90_64x256x32_F32E4M3E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x256x32_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::MMA_64x248x32_F32E4M3E5M2_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 240 == 0) {
-        return SM90_64x240x32_F32E4M3E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x240x32_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::MMA_64x232x32_F32E4M3E5M2_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 224 == 0) {
-        return SM90_64x224x32_F32E4M3E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x224x32_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::MMA_64x216x32_F32E4M3E5M2_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 208 == 0) {
-        return SM90_64x208x32_F32E4M3E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x208x32_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::MMA_64x200x32_F32E4M3E5M2_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 192 == 0) {
-        return SM90_64x192x32_F32E4M3E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x192x32_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::MMA_64x184x32_F32E4M3E5M2_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 176 == 0) {
-        return SM90_64x176x32_F32E4M3E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x176x32_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::MMA_64x168x32_F32E4M3E5M2_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 160 == 0) {
-        return SM90_64x160x32_F32E4M3E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x160x32_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::MMA_64x152x32_F32E4M3E5M2_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 144 == 0) {
-        return SM90_64x144x32_F32E4M3E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x144x32_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::MMA_64x136x32_F32E4M3E5M2_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 128 == 0) {
-        return SM90_64x128x32_F32E4M3E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x128x32_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::MMA_64x120x32_F32E4M3E5M2_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 112 == 0) {
-        return SM90_64x112x32_F32E4M3E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x112x32_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::MMA_64x104x32_F32E4M3E5M2_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 96 == 0) {
-        return SM90_64x96x32_F32E4M3E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x96x32_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::MMA_64x88x32_F32E4M3E5M2_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 80 == 0) {
-        return SM90_64x80x32_F32E4M3E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x80x32_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::MMA_64x72x32_F32E4M3E5M2_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 64 == 0) {
-        return SM90_64x64x32_F32E4M3E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x64x32_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::MMA_64x56x32_F32E4M3E5M2_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 48 == 0) {
-        return SM90_64x48x32_F32E4M3E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x48x32_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::MMA_64x40x32_F32E4M3E5M2_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 32 == 0) {
-        return SM90_64x32x32_F32E4M3E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x32x32_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::MMA_64x24x32_F32E4M3E5M2_SS_TN<Args...>{};
       }
+#endif
       else if constexpr (Tile_N % 16 == 0) {
-        return SM90_64x16x32_F32E4M3E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x16x32_F32E4M3E5M2_SS_TN<Args...>{};
       }
       else if constexpr (Tile_N % 8 == 0) {
-        return SM90_64x8x32_F32E4M3E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x8x32_F32E4M3E5M2_SS_TN<Args...>{};
       }
       else {
         static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
@@ -1174,73 +1932,148 @@ ss_op_selector()
       static_assert(size<2>(TileShape_MNK{}) % 32 == 0, "Tile_K must be a multiple of 32.");
 
       if constexpr (Tile_N % 256 == 0) {
-        return SM90_64x256x32_F32E5M2E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x256x32_F32E5M2E4M3_SS_TN<Args...>{};
       }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::MMA_64x248x32_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 240 == 0) {
-        return SM90_64x240x32_F32E5M2E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x240x32_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::MMA_64x232x32_F32E5M2E4M3_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 224 == 0) {
-        return SM90_64x224x32_F32E5M2E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x224x32_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::MMA_64x216x32_F32E5M2E4M3_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 208 == 0) {
-        return SM90_64x208x32_F32E5M2E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x208x32_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::MMA_64x200x32_F32E5M2E4M3_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 192 == 0) {
-        return SM90_64x192x32_F32E5M2E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x192x32_F32E5M2E4M3_SS_TN<Args...>{};
       }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::MMA_64x184x32_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 176 == 0) {
-        return SM90_64x176x32_F32E5M2E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x176x32_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::MMA_64x168x32_F32E5M2E4M3_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 160 == 0) {
-        return SM90_64x160x32_F32E5M2E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x160x32_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::MMA_64x152x32_F32E5M2E4M3_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 144 == 0) {
-        return SM90_64x144x32_F32E5M2E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x144x32_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::MMA_64x136x32_F32E5M2E4M3_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 128 == 0) {
-        return SM90_64x128x32_F32E5M2E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x128x32_F32E5M2E4M3_SS_TN<Args...>{};
       }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::MMA_64x120x32_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 112 == 0) {
-        return SM90_64x112x32_F32E5M2E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x112x32_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::MMA_64x104x32_F32E5M2E4M3_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 96 == 0) {
-        return SM90_64x96x32_F32E5M2E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x96x32_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::MMA_64x88x32_F32E5M2E4M3_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 80 == 0) {
-        return SM90_64x80x32_F32E5M2E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x80x32_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::MMA_64x72x32_F32E5M2E4M3_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 64 == 0) {
-        return SM90_64x64x32_F32E5M2E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x64x32_F32E5M2E4M3_SS_TN<Args...>{};
       }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::MMA_64x56x32_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 48 == 0) {
-        return SM90_64x48x32_F32E5M2E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x48x32_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::MMA_64x40x32_F32E5M2E4M3_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 32 == 0) {
-        return SM90_64x32x32_F32E5M2E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x32x32_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::MMA_64x24x32_F32E5M2E4M3_SS_TN<Args...>{};
       }
+#endif
       else if constexpr (Tile_N % 16 == 0) {
-        return SM90_64x16x32_F32E5M2E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x16x32_F32E5M2E4M3_SS_TN<Args...>{};
       }
       else if constexpr (Tile_N % 8 == 0) {
-        return SM90_64x8x32_F32E5M2E4M3_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x8x32_F32E5M2E4M3_SS_TN<Args...>{};
       }
       else {
         static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
@@ -1254,73 +2087,148 @@ ss_op_selector()
       static_assert(size<2>(TileShape_MNK{}) % 32 == 0, "Tile_K must be a multiple of 32.");
 
       if constexpr (Tile_N % 256 == 0) {
-        return SM90_64x256x32_F32E5M2E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x256x32_F32E5M2E5M2_SS_TN<Args...>{};
       }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-      else if constexpr (Tile_N % 240 == 0) {
-        return SM90_64x240x32_F32E5M2E5M2_SS_TN<Args...>{};
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::MMA_64x248x32_F32E5M2E5M2_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-      else if constexpr (Tile_N % 224 == 0) {
-        return SM90_64x224x32_F32E5M2E5M2_SS_TN<Args...>{};
+      else if constexpr (Tile_N % 240 == 0) {
+        return SM90::GMMA::MMA_64x240x32_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::MMA_64x232x32_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 224 == 0) {
+        return SM90::GMMA::MMA_64x224x32_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::MMA_64x216x32_F32E5M2E5M2_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 208 == 0) {
-        return SM90_64x208x32_F32E5M2E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x208x32_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::MMA_64x200x32_F32E5M2E5M2_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 192 == 0) {
-        return SM90_64x192x32_F32E5M2E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x192x32_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::MMA_64x184x32_F32E5M2E5M2_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 176 == 0) {
-        return SM90_64x176x32_F32E5M2E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x176x32_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::MMA_64x168x32_F32E5M2E5M2_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 160 == 0) {
-        return SM90_64x160x32_F32E5M2E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x160x32_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::MMA_64x152x32_F32E5M2E5M2_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 144 == 0) {
-        return SM90_64x144x32_F32E5M2E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x144x32_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::MMA_64x136x32_F32E5M2E5M2_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 128 == 0) {
-        return SM90_64x128x32_F32E5M2E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x128x32_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::MMA_64x120x32_F32E5M2E5M2_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 112 == 0) {
-        return SM90_64x112x32_F32E5M2E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x112x32_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::MMA_64x104x32_F32E5M2E5M2_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 96 == 0) {
-        return SM90_64x96x32_F32E5M2E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x96x32_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::MMA_64x88x32_F32E5M2E5M2_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 80 == 0) {
-        return SM90_64x80x32_F32E5M2E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x80x32_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::MMA_64x72x32_F32E5M2E5M2_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 64 == 0) {
-        return SM90_64x64x32_F32E5M2E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x64x32_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::MMA_64x56x32_F32E5M2E5M2_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 48 == 0) {
-        return SM90_64x48x32_F32E5M2E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x48x32_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::MMA_64x40x32_F32E5M2E5M2_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 32 == 0) {
-        return SM90_64x32x32_F32E5M2E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x32x32_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::MMA_64x24x32_F32E5M2E5M2_SS_TN<Args...>{};
       }
+#endif
       else if constexpr (Tile_N % 16 == 0) {
-        return SM90_64x16x32_F32E5M2E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x16x32_F32E5M2E5M2_SS_TN<Args...>{};
       }
       else if constexpr (Tile_N % 8 == 0) {
-        return SM90_64x8x32_F32E5M2E5M2_SS_TN<Args...>{};
+        return SM90::GMMA::MMA_64x8x32_F32E5M2E5M2_SS_TN<Args...>{};
       }
       else {
         static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
@@ -1342,73 +2250,78 @@ ss_op_selector()
       static_assert(size<2>(TileShape_MNK{}) % 32 == 0, "Tile_K must be a multiple of 32.");
 
       if constexpr (Tile_N % 256 == 0) {
-        return SM90_64x256x32_S32S8S8_SS_TN{};
+        return SM90::GMMA::MMA_64x256x32_S32S8S8_SS_TN{};
       }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 240 == 0) {
-        return SM90_64x240x32_S32S8S8_SS_TN{};
+        return SM90::GMMA::MMA_64x240x32_S32S8S8_SS_TN{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 224 == 0) {
-        return SM90_64x224x32_S32S8S8_SS_TN{};
+        return SM90::GMMA::MMA_64x224x32_S32S8S8_SS_TN{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 208 == 0) {
-        return SM90_64x208x32_S32S8S8_SS_TN{};
+        return SM90::GMMA::MMA_64x208x32_S32S8S8_SS_TN{};
       }
 #endif
       else if constexpr (Tile_N % 192 == 0) {
-        return SM90_64x192x32_S32S8S8_SS_TN{};
+        return SM90::GMMA::MMA_64x192x32_S32S8S8_SS_TN{};
       }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 176 == 0) {
-        return SM90_64x176x32_S32S8S8_SS_TN{};
+        return SM90::GMMA::MMA_64x176x32_S32S8S8_SS_TN{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 160 == 0) {
-        return SM90_64x160x32_S32S8S8_SS_TN{};
+        return SM90::GMMA::MMA_64x160x32_S32S8S8_SS_TN{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 144 == 0) {
-        return SM90_64x144x32_S32S8S8_SS_TN{};
+        return SM90::GMMA::MMA_64x144x32_S32S8S8_SS_TN{};
       }
 #endif
       else if constexpr (Tile_N % 128 == 0) {
-        return SM90_64x128x32_S32S8S8_SS_TN{};
+        return SM90::GMMA::MMA_64x128x32_S32S8S8_SS_TN{};
       }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 112 == 0) {
-        return SM90_64x112x32_S32S8S8_SS_TN{};
+        return SM90::GMMA::MMA_64x112x32_S32S8S8_SS_TN{};
       }
 #endif
       else if constexpr (Tile_N % 96 == 0) {
-        return SM90_64x96x32_S32S8S8_SS_TN{};
+        return SM90::GMMA::MMA_64x96x32_S32S8S8_SS_TN{};
       }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 80 == 0) {
-        return SM90_64x80x32_S32S8S8_SS_TN{};
+        return SM90::GMMA::MMA_64x80x32_S32S8S8_SS_TN{};
       }
 #endif
       else if constexpr (Tile_N % 64 == 0) {
-        return SM90_64x64x32_S32S8S8_SS_TN{};
+        return SM90::GMMA::MMA_64x64x32_S32S8S8_SS_TN{};
       }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 48 == 0) {
-        return SM90_64x48x32_S32S8S8_SS_TN{};
+        return SM90::GMMA::MMA_64x48x32_S32S8S8_SS_TN{};
       }
 #endif
       else if constexpr (Tile_N % 32 == 0) {
-        return SM90_64x32x32_S32S8S8_SS_TN{};
+        return SM90::GMMA::MMA_64x32x32_S32S8S8_SS_TN{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::MMA_64x24x32_S32S8S8_SS_TN{};
       }
+#endif
       else if constexpr (Tile_N % 16 == 0) {
-        return SM90_64x16x32_S32S8S8_SS_TN{};
+        return SM90::GMMA::MMA_64x16x32_S32S8S8_SS_TN{};
       }
       else if constexpr (Tile_N % 8 == 0) {
-        return SM90_64x8x32_S32S8S8_SS_TN{};
+        return SM90::GMMA::MMA_64x8x32_S32S8S8_SS_TN{};
       }
       else {
         static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
@@ -1422,73 +2335,78 @@ ss_op_selector()
       static_assert(size<2>(TileShape_MNK{}) % 32 == 0, "Tile_K must be a multiple of 32.");
 
       if constexpr (Tile_N % 256 == 0) {
-        return SM90_64x256x32_S32S8U8_SS_TN{};
+        return SM90::GMMA::MMA_64x256x32_S32S8U8_SS_TN{};
       }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 240 == 0) {
-        return SM90_64x240x32_S32S8U8_SS_TN{};
+        return SM90::GMMA::MMA_64x240x32_S32S8U8_SS_TN{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 224 == 0) {
-        return SM90_64x224x32_S32S8U8_SS_TN{};
+        return SM90::GMMA::MMA_64x224x32_S32S8U8_SS_TN{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 208 == 0) {
-        return SM90_64x208x32_S32S8U8_SS_TN{};
+        return SM90::GMMA::MMA_64x208x32_S32S8U8_SS_TN{};
       }
 #endif
       else if constexpr (Tile_N % 192 == 0) {
-        return SM90_64x192x32_S32S8U8_SS_TN{};
+        return SM90::GMMA::MMA_64x192x32_S32S8U8_SS_TN{};
       }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 176 == 0) {
-        return SM90_64x176x32_S32S8U8_SS_TN{};
+        return SM90::GMMA::MMA_64x176x32_S32S8U8_SS_TN{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 160 == 0) {
-        return SM90_64x160x32_S32S8U8_SS_TN{};
+        return SM90::GMMA::MMA_64x160x32_S32S8U8_SS_TN{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 144 == 0) {
-        return SM90_64x144x32_S32S8U8_SS_TN{};
+        return SM90::GMMA::MMA_64x144x32_S32S8U8_SS_TN{};
       }
 #endif
       else if constexpr (Tile_N % 128 == 0) {
-        return SM90_64x128x32_S32S8U8_SS_TN{};
+        return SM90::GMMA::MMA_64x128x32_S32S8U8_SS_TN{};
       }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 112 == 0) {
-        return SM90_64x112x32_S32S8U8_SS_TN{};
+        return SM90::GMMA::MMA_64x112x32_S32S8U8_SS_TN{};
       }
 #endif
       else if constexpr (Tile_N % 96 == 0) {
-        return SM90_64x96x32_S32S8U8_SS_TN{};
+        return SM90::GMMA::MMA_64x96x32_S32S8U8_SS_TN{};
       }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 80 == 0) {
-        return SM90_64x80x32_S32S8U8_SS_TN{};
+        return SM90::GMMA::MMA_64x80x32_S32S8U8_SS_TN{};
       }
 #endif
       else if constexpr (Tile_N % 64 == 0) {
-        return SM90_64x64x32_S32S8U8_SS_TN{};
+        return SM90::GMMA::MMA_64x64x32_S32S8U8_SS_TN{};
       }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 48 == 0) {
-        return SM90_64x48x32_S32S8U8_SS_TN{};
+        return SM90::GMMA::MMA_64x48x32_S32S8U8_SS_TN{};
       }
 #endif
       else if constexpr (Tile_N % 32 == 0) {
-        return SM90_64x32x32_S32S8U8_SS_TN{};
+        return SM90::GMMA::MMA_64x32x32_S32S8U8_SS_TN{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::MMA_64x24x32_S32S8U8_SS_TN{};
       }
+#endif
       else if constexpr (Tile_N % 16 == 0) {
-        return SM90_64x16x32_S32S8U8_SS_TN{};
+        return SM90::GMMA::MMA_64x16x32_S32S8U8_SS_TN{};
       }
       else if constexpr (Tile_N % 8 == 0) {
-        return SM90_64x8x32_S32S8U8_SS_TN{};
+        return SM90::GMMA::MMA_64x8x32_S32S8U8_SS_TN{};
       }
       else {
         static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
@@ -1502,73 +2420,78 @@ ss_op_selector()
       static_assert(size<2>(TileShape_MNK{}) % 32 == 0, "Tile_K must be a multiple of 32.");
 
       if constexpr (Tile_N % 256 == 0) {
-        return SM90_64x256x32_S32U8S8_SS_TN{};
+        return SM90::GMMA::MMA_64x256x32_S32U8S8_SS_TN{};
       }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 240 == 0) {
-        return SM90_64x240x32_S32U8S8_SS_TN{};
+        return SM90::GMMA::MMA_64x240x32_S32U8S8_SS_TN{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 224 == 0) {
-        return SM90_64x224x32_S32U8S8_SS_TN{};
+        return SM90::GMMA::MMA_64x224x32_S32U8S8_SS_TN{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 208 == 0) {
-        return SM90_64x208x32_S32U8S8_SS_TN{};
+        return SM90::GMMA::MMA_64x208x32_S32U8S8_SS_TN{};
       }
 #endif
       else if constexpr (Tile_N % 192 == 0) {
-        return SM90_64x192x32_S32U8S8_SS_TN{};
+        return SM90::GMMA::MMA_64x192x32_S32U8S8_SS_TN{};
       }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 176 == 0) {
-        return SM90_64x176x32_S32U8S8_SS_TN{};
+        return SM90::GMMA::MMA_64x176x32_S32U8S8_SS_TN{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 160 == 0) {
-        return SM90_64x160x32_S32U8S8_SS_TN{};
+        return SM90::GMMA::MMA_64x160x32_S32U8S8_SS_TN{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 144 == 0) {
-        return SM90_64x144x32_S32U8S8_SS_TN{};
+        return SM90::GMMA::MMA_64x144x32_S32U8S8_SS_TN{};
       }
 #endif
       else if constexpr (Tile_N % 128 == 0) {
-        return SM90_64x128x32_S32U8S8_SS_TN{};
+        return SM90::GMMA::MMA_64x128x32_S32U8S8_SS_TN{};
       }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 112 == 0) {
-        return SM90_64x112x32_S32U8S8_SS_TN{};
+        return SM90::GMMA::MMA_64x112x32_S32U8S8_SS_TN{};
       }
 #endif
       else if constexpr (Tile_N % 96 == 0) {
-        return SM90_64x96x32_S32U8S8_SS_TN{};
+        return SM90::GMMA::MMA_64x96x32_S32U8S8_SS_TN{};
       }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 80 == 0) {
-        return SM90_64x80x32_S32U8S8_SS_TN{};
+        return SM90::GMMA::MMA_64x80x32_S32U8S8_SS_TN{};
       }
 #endif
       else if constexpr (Tile_N % 64 == 0) {
-        return SM90_64x64x32_S32U8S8_SS_TN{};
+        return SM90::GMMA::MMA_64x64x32_S32U8S8_SS_TN{};
       }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 48 == 0) {
-        return SM90_64x48x32_S32U8S8_SS_TN{};
+        return SM90::GMMA::MMA_64x48x32_S32U8S8_SS_TN{};
       }
 #endif
       else if constexpr (Tile_N % 32 == 0) {
-        return SM90_64x32x32_S32U8S8_SS_TN{};
+        return SM90::GMMA::MMA_64x32x32_S32U8S8_SS_TN{};
       }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::MMA_64x24x32_S32U8S8_SS_TN{};
+      }
+#endif
       else if constexpr (Tile_N % 16 == 0) {
-        return SM90_64x16x32_S32U8S8_SS_TN{};
+        return SM90::GMMA::MMA_64x16x32_S32U8S8_SS_TN{};
       }
       else if constexpr (Tile_N % 8 == 0) {
-        return SM90_64x8x32_S32U8S8_SS_TN{};
+        return SM90::GMMA::MMA_64x8x32_S32U8S8_SS_TN{};
       }
       else {
         static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
@@ -1582,73 +2505,78 @@ ss_op_selector()
       static_assert(size<2>(TileShape_MNK{}) % 32 == 0, "Tile_K must be a multiple of 32.");
 
       if constexpr (Tile_N % 256 == 0) {
-        return SM90_64x256x32_S32U8U8_SS_TN{};
+        return SM90::GMMA::MMA_64x256x32_S32U8U8_SS_TN{};
       }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 240 == 0) {
-        return SM90_64x240x32_S32U8U8_SS_TN{};
+        return SM90::GMMA::MMA_64x240x32_S32U8U8_SS_TN{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 224 == 0) {
-        return SM90_64x224x32_S32U8U8_SS_TN{};
+        return SM90::GMMA::MMA_64x224x32_S32U8U8_SS_TN{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 208 == 0) {
-        return SM90_64x208x32_S32U8U8_SS_TN{};
+        return SM90::GMMA::MMA_64x208x32_S32U8U8_SS_TN{};
       }
 #endif
       else if constexpr (Tile_N % 192 == 0) {
-        return SM90_64x192x32_S32U8U8_SS_TN{};
+        return SM90::GMMA::MMA_64x192x32_S32U8U8_SS_TN{};
       }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 176 == 0) {
-        return SM90_64x176x32_S32U8U8_SS_TN{};
+        return SM90::GMMA::MMA_64x176x32_S32U8U8_SS_TN{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 160 == 0) {
-        return SM90_64x160x32_S32U8U8_SS_TN{};
+        return SM90::GMMA::MMA_64x160x32_S32U8U8_SS_TN{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 144 == 0) {
-        return SM90_64x144x32_S32U8U8_SS_TN{};
+        return SM90::GMMA::MMA_64x144x32_S32U8U8_SS_TN{};
       }
 #endif
       else if constexpr (Tile_N % 128 == 0) {
-        return SM90_64x128x32_S32U8U8_SS_TN{};
+        return SM90::GMMA::MMA_64x128x32_S32U8U8_SS_TN{};
       }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 112 == 0) {
-        return SM90_64x112x32_S32U8U8_SS_TN{};
+        return SM90::GMMA::MMA_64x112x32_S32U8U8_SS_TN{};
       }
 #endif
       else if constexpr (Tile_N % 96 == 0) {
-        return SM90_64x96x32_S32U8U8_SS_TN{};
+        return SM90::GMMA::MMA_64x96x32_S32U8U8_SS_TN{};
       }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 80 == 0) {
-        return SM90_64x80x32_S32U8U8_SS_TN{};
+        return SM90::GMMA::MMA_64x80x32_S32U8U8_SS_TN{};
       }
 #endif
       else if constexpr (Tile_N % 64 == 0) {
-        return SM90_64x64x32_S32U8U8_SS_TN{};
+        return SM90::GMMA::MMA_64x64x32_S32U8U8_SS_TN{};
       }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 48 == 0) {
-        return SM90_64x48x32_S32U8U8_SS_TN{};
+        return SM90::GMMA::MMA_64x48x32_S32U8U8_SS_TN{};
       }
 #endif
       else if constexpr (Tile_N % 32 == 0) {
-        return SM90_64x32x32_S32U8U8_SS_TN{};
+        return SM90::GMMA::MMA_64x32x32_S32U8U8_SS_TN{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::MMA_64x24x32_S32U8U8_SS_TN{};
       }
+#endif
       else if constexpr (Tile_N % 16 == 0) {
-        return SM90_64x16x32_S32U8U8_SS_TN{};
+        return SM90::GMMA::MMA_64x16x32_S32U8U8_SS_TN{};
       }
       else if constexpr (Tile_N % 8 == 0) {
-        return SM90_64x8x32_S32U8U8_SS_TN{};
+        return SM90::GMMA::MMA_64x8x32_S32U8U8_SS_TN{};
       }
       else {
         static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
@@ -1678,12 +2606,11 @@ template <
 >
 CUTE_HOST_DEVICE constexpr
 auto
-rs_op_selector()
+ss_op_selector_sparse()
 {
   static_assert(is_static<TileShape_MNK>::value, "TileShape_MNK must be static.");
   static_assert(rank(TileShape_MNK{}) == 3, "TileShape_MNK must be rank 3.");
   static_assert(size<0>(TileShape_MNK{}) % 64 == 0, "Tile_M must be a multiple of 64.");
-  static_assert(MajorA == GMMA::Major::K, "Register source A operand GMMAs must have K-major A layout.");
   auto Tile_N = size<1>(TileShape_MNK{});
 
   // F16 accumulator
@@ -1691,76 +2618,151 @@ rs_op_selector()
 
     // Input A: half_t ; Input B: half_t
     if constexpr (is_same_v<ElementA, half_t> && is_same_v<ElementB, half_t>) {
-      static_assert(size<2>(TileShape_MNK{}) % 16 == 0, "Tile_K must be a multiple of 16.");
+      static_assert(size<2>(TileShape_MNK{}) % 32 == 0, "Tile_K must be a multiple of 32.");
 
       if constexpr (Tile_N % 256 == 0) {
-        return SM90_64x256x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x256x32_F16F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x248x32_F16F16F16_SS<MajorA, MajorB, Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 240 == 0) {
-        return SM90_64x240x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x240x32_F16F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x232x32_F16F16F16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 224 == 0) {
-        return SM90_64x224x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x224x32_F16F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x216x32_F16F16F16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 208 == 0) {
-        return SM90_64x208x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x208x32_F16F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x200x32_F16F16F16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 192 == 0) {
-        return SM90_64x192x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x192x32_F16F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x184x32_F16F16F16_SS<MajorA, MajorB, Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 176 == 0) {
-        return SM90_64x176x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x176x32_F16F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x168x32_F16F16F16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 160 == 0) {
-        return SM90_64x160x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x160x32_F16F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x152x32_F16F16F16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 144 == 0) {
-        return SM90_64x144x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x144x32_F16F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x136x32_F16F16F16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 128 == 0) {
-        return SM90_64x128x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x128x32_F16F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x120x32_F16F16F16_SS<MajorA, MajorB, Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 112 == 0) {
-        return SM90_64x112x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x112x32_F16F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x104x32_F16F16F16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 96 == 0) {
-        return SM90_64x96x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x96x32_F16F16F16_SS<MajorA, MajorB, Args...>{};
       }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x88x32_F16F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 80 == 0) {
-        return SM90_64x80x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x80x32_F16F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x72x32_F16F16F16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 64 == 0) {
-        return SM90_64x64x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x64x32_F16F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x56x32_F16F16F16_SS<MajorA, MajorB, Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 48 == 0) {
-        return SM90_64x48x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x48x32_F16F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x40x32_F16F16F16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 32 == 0) {
-        return SM90_64x32x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x32x32_F16F16F16_SS<MajorA, MajorB, Args...>{};
       }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x24x32_F16F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
       else if constexpr (Tile_N % 16 == 0) {
-        return SM90_64x16x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x16x32_F16F16F16_SS<MajorA, MajorB, Args...>{};
       }
       else if constexpr (Tile_N % 8 == 0) {
-        return SM90_64x8x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x8x32_F16F16F16_SS<MajorA, MajorB, Args...>{};
       }
       else {
         static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
@@ -1771,76 +2773,151 @@ rs_op_selector()
     else if constexpr (is_same_v<ElementA, float_e4m3_t> && is_same_v<ElementB, float_e4m3_t>) {
       static_assert(MajorA == GMMA::Major::K, "MajorA must be GMMA::Major::K for this config.");
       static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
-      static_assert(size<2>(TileShape_MNK{}) % 32 == 0, "Tile_K must be a multiple of 32.");
+      static_assert(size<2>(TileShape_MNK{}) % 64 == 0, "Tile_K must be a multiple of 64.");
 
       if constexpr (Tile_N % 256 == 0) {
-        return SM90_64x256x32_F16E4M3E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x256x64_F16E4M3E4M3_SS_TN<Args...>{};
       }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x248x64_F16E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 240 == 0) {
-        return SM90_64x240x32_F16E4M3E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x240x64_F16E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x232x64_F16E4M3E4M3_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 224 == 0) {
-        return SM90_64x224x32_F16E4M3E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x224x64_F16E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x216x64_F16E4M3E4M3_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 208 == 0) {
-        return SM90_64x208x32_F16E4M3E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x208x64_F16E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x200x64_F16E4M3E4M3_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 192 == 0) {
-        return SM90_64x192x32_F16E4M3E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x192x64_F16E4M3E4M3_SS_TN<Args...>{};
       }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x184x64_F16E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 176 == 0) {
-        return SM90_64x176x32_F16E4M3E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x176x64_F16E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x168x64_F16E4M3E4M3_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 160 == 0) {
-        return SM90_64x160x32_F16E4M3E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x160x64_F16E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x152x64_F16E4M3E4M3_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 144 == 0) {
-        return SM90_64x144x32_F16E4M3E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x144x64_F16E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x136x64_F16E4M3E4M3_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 128 == 0) {
-        return SM90_64x128x32_F16E4M3E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x128x64_F16E4M3E4M3_SS_TN<Args...>{};
       }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x120x64_F16E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 112 == 0) {
-        return SM90_64x112x32_F16E4M3E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x112x64_F16E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x104x64_F16E4M3E4M3_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 96 == 0) {
-        return SM90_64x96x32_F16E4M3E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x96x64_F16E4M3E4M3_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x88x64_F16E4M3E4M3_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 80 == 0) {
-        return SM90_64x80x32_F16E4M3E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x80x64_F16E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x72x64_F16E4M3E4M3_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 64 == 0) {
-        return SM90_64x64x32_F16E4M3E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x64x64_F16E4M3E4M3_SS_TN<Args...>{};
       }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x56x64_F16E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 48 == 0) {
-        return SM90_64x48x32_F16E4M3E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x48x64_F16E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x40x64_F16E4M3E4M3_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 32 == 0) {
-        return SM90_64x32x32_F16E4M3E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x32x64_F16E4M3E4M3_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x24x64_F16E4M3E4M3_SS_TN<Args...>{};
       }
+#endif
       else if constexpr (Tile_N % 16 == 0) {
-        return SM90_64x16x32_F16E4M3E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x16x64_F16E4M3E4M3_SS_TN<Args...>{};
       }
       else if constexpr (Tile_N % 8 == 0) {
-        return SM90_64x8x32_F16E4M3E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x8x64_F16E4M3E4M3_SS_TN<Args...>{};
       }
       else {
         static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
@@ -1851,76 +2928,151 @@ rs_op_selector()
     else if constexpr (is_same_v<ElementA, float_e4m3_t> && is_same_v<ElementB, float_e5m2_t>) {
       static_assert(MajorA == GMMA::Major::K, "MajorA must be GMMA::Major::K for this config.");
       static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
-      static_assert(size<2>(TileShape_MNK{}) % 32 == 0, "Tile_K must be a multiple of 32.");
+      static_assert(size<2>(TileShape_MNK{}) % 64 == 0, "Tile_K must be a multiple of 64.");
 
       if constexpr (Tile_N % 256 == 0) {
-        return SM90_64x256x32_F16E4M3E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x256x64_F16E4M3E5M2_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x248x64_F16E4M3E5M2_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 240 == 0) {
-        return SM90_64x240x32_F16E4M3E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x240x64_F16E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x232x64_F16E4M3E5M2_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 224 == 0) {
-        return SM90_64x224x32_F16E4M3E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x224x64_F16E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x216x64_F16E4M3E5M2_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 208 == 0) {
-        return SM90_64x208x32_F16E4M3E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x208x64_F16E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x200x64_F16E4M3E5M2_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 192 == 0) {
-        return SM90_64x192x32_F16E4M3E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x192x64_F16E4M3E5M2_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x184x64_F16E4M3E5M2_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 176 == 0) {
-        return SM90_64x176x32_F16E4M3E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x176x64_F16E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x168x64_F16E4M3E5M2_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 160 == 0) {
-        return SM90_64x160x32_F16E4M3E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x160x64_F16E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x152x64_F16E4M3E5M2_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 144 == 0) {
-        return SM90_64x144x32_F16E4M3E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x144x64_F16E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x136x64_F16E4M3E5M2_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 128 == 0) {
-        return SM90_64x128x32_F16E4M3E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x128x64_F16E4M3E5M2_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x120x64_F16E4M3E5M2_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 112 == 0) {
-        return SM90_64x112x32_F16E4M3E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x112x64_F16E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x104x64_F16E4M3E5M2_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 96 == 0) {
-        return SM90_64x96x32_F16E4M3E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x96x64_F16E4M3E5M2_SS_TN<Args...>{};
       }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x88x64_F16E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 80 == 0) {
-        return SM90_64x80x32_F16E4M3E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x80x64_F16E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x72x64_F16E4M3E5M2_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 64 == 0) {
-        return SM90_64x64x32_F16E4M3E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x64x64_F16E4M3E5M2_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x56x64_F16E4M3E5M2_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 48 == 0) {
-        return SM90_64x48x32_F16E4M3E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x48x64_F16E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x40x64_F16E4M3E5M2_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 32 == 0) {
-        return SM90_64x32x32_F16E4M3E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x32x64_F16E4M3E5M2_SS_TN<Args...>{};
       }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x24x64_F16E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
       else if constexpr (Tile_N % 16 == 0) {
-        return SM90_64x16x32_F16E4M3E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x16x64_F16E4M3E5M2_SS_TN<Args...>{};
       }
       else if constexpr (Tile_N % 8 == 0) {
-        return SM90_64x8x32_F16E4M3E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x8x64_F16E4M3E5M2_SS_TN<Args...>{};
       }
       else {
         static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
@@ -1931,76 +3083,151 @@ rs_op_selector()
     else if constexpr (is_same_v<ElementA, float_e5m2_t> && is_same_v<ElementB, float_e4m3_t>) {
       static_assert(MajorA == GMMA::Major::K, "MajorA must be GMMA::Major::K for this config.");
       static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
-      static_assert(size<2>(TileShape_MNK{}) % 32 == 0, "Tile_K must be a multiple of 32.");
+      static_assert(size<2>(TileShape_MNK{}) % 64 == 0, "Tile_K must be a multiple of 64.");
 
       if constexpr (Tile_N % 256 == 0) {
-        return SM90_64x256x32_F16E5M2E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x256x64_F16E5M2E4M3_SS_TN<Args...>{};
       }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x248x64_F16E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 240 == 0) {
-        return SM90_64x240x32_F16E5M2E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x240x64_F16E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x232x64_F16E5M2E4M3_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 224 == 0) {
-        return SM90_64x224x32_F16E5M2E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x224x64_F16E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x216x64_F16E5M2E4M3_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 208 == 0) {
-        return SM90_64x208x32_F16E5M2E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x208x64_F16E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x200x64_F16E5M2E4M3_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 192 == 0) {
-        return SM90_64x192x32_F16E5M2E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x192x64_F16E5M2E4M3_SS_TN<Args...>{};
       }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x184x64_F16E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 176 == 0) {
-        return SM90_64x176x32_F16E5M2E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x176x64_F16E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x168x64_F16E5M2E4M3_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 160 == 0) {
-        return SM90_64x160x32_F16E5M2E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x160x64_F16E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x152x64_F16E5M2E4M3_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 144 == 0) {
-        return SM90_64x144x32_F16E5M2E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x144x64_F16E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x136x64_F16E5M2E4M3_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 128 == 0) {
-        return SM90_64x128x32_F16E5M2E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x128x64_F16E5M2E4M3_SS_TN<Args...>{};
       }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x120x64_F16E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 112 == 0) {
-        return SM90_64x112x32_F16E5M2E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x112x64_F16E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x104x64_F16E5M2E4M3_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 96 == 0) {
-        return SM90_64x96x32_F16E5M2E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x96x64_F16E5M2E4M3_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x88x64_F16E5M2E4M3_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 80 == 0) {
-        return SM90_64x80x32_F16E5M2E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x80x64_F16E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x72x64_F16E5M2E4M3_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 64 == 0) {
-        return SM90_64x64x32_F16E5M2E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x64x64_F16E5M2E4M3_SS_TN<Args...>{};
       }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x56x64_F16E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 48 == 0) {
-        return SM90_64x48x32_F16E5M2E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x48x64_F16E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x40x64_F16E5M2E4M3_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 32 == 0) {
-        return SM90_64x32x32_F16E5M2E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x32x64_F16E5M2E4M3_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x24x64_F16E5M2E4M3_SS_TN<Args...>{};
       }
+#endif
       else if constexpr (Tile_N % 16 == 0) {
-        return SM90_64x16x32_F16E5M2E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x16x64_F16E5M2E4M3_SS_TN<Args...>{};
       }
       else if constexpr (Tile_N % 8 == 0) {
-        return SM90_64x8x32_F16E5M2E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x8x64_F16E5M2E4M3_SS_TN<Args...>{};
       }
       else {
         static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
@@ -2011,76 +3238,151 @@ rs_op_selector()
     else if constexpr (is_same_v<ElementA, float_e5m2_t> && is_same_v<ElementB, float_e5m2_t>) {
       static_assert(MajorA == GMMA::Major::K, "MajorA must be GMMA::Major::K for this config.");
       static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
-      static_assert(size<2>(TileShape_MNK{}) % 32 == 0, "Tile_K must be a multiple of 32.");
+      static_assert(size<2>(TileShape_MNK{}) % 64 == 0, "Tile_K must be a multiple of 64.");
 
       if constexpr (Tile_N % 256 == 0) {
-        return SM90_64x256x32_F16E5M2E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x256x64_F16E5M2E5M2_SS_TN<Args...>{};
       }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-      else if constexpr (Tile_N % 240 == 0) {
-        return SM90_64x240x32_F16E5M2E5M2_RS_TN<Args...>{};
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x248x64_F16E5M2E5M2_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-      else if constexpr (Tile_N % 224 == 0) {
-        return SM90_64x224x32_F16E5M2E5M2_RS_TN<Args...>{};
+      else if constexpr (Tile_N % 240 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x240x64_F16E5M2E5M2_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-      else if constexpr (Tile_N % 208 == 0) {
-        return SM90_64x208x32_F16E5M2E5M2_RS_TN<Args...>{};
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x232x64_F16E5M2E5M2_SS_TN<Args...>{};
       }
 #endif
-      else if constexpr (Tile_N % 192 == 0) {
-        return SM90_64x192x32_F16E5M2E5M2_RS_TN<Args...>{};
-      }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-      else if constexpr (Tile_N % 176 == 0) {
-        return SM90_64x176x32_F16E5M2E5M2_RS_TN<Args...>{};
+      else if constexpr (Tile_N % 224 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x224x64_F16E5M2E5M2_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-      else if constexpr (Tile_N % 160 == 0) {
-        return SM90_64x160x32_F16E5M2E5M2_RS_TN<Args...>{};
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x216x64_F16E5M2E5M2_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-      else if constexpr (Tile_N % 144 == 0) {
-        return SM90_64x144x32_F16E5M2E5M2_RS_TN<Args...>{};
+      else if constexpr (Tile_N % 208 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x208x64_F16E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x200x64_F16E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 192 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x192x64_F16E5M2E5M2_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x184x64_F16E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 176 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x176x64_F16E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x168x64_F16E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 160 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x160x64_F16E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x152x64_F16E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 144 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x144x64_F16E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x136x64_F16E5M2E5M2_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 128 == 0) {
-        return SM90_64x128x32_F16E5M2E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x128x64_F16E5M2E5M2_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x120x64_F16E5M2E5M2_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 112 == 0) {
-        return SM90_64x112x32_F16E5M2E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x112x64_F16E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x104x64_F16E5M2E5M2_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 96 == 0) {
-        return SM90_64x96x32_F16E5M2E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x96x64_F16E5M2E5M2_SS_TN<Args...>{};
       }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x88x64_F16E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 80 == 0) {
-        return SM90_64x80x32_F16E5M2E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x80x64_F16E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x72x64_F16E5M2E5M2_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 64 == 0) {
-        return SM90_64x64x32_F16E5M2E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x64x64_F16E5M2E5M2_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x56x64_F16E5M2E5M2_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 48 == 0) {
-        return SM90_64x48x32_F16E5M2E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x48x64_F16E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x40x64_F16E5M2E5M2_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 32 == 0) {
-        return SM90_64x32x32_F16E5M2E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x32x64_F16E5M2E5M2_SS_TN<Args...>{};
       }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x24x64_F16E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
       else if constexpr (Tile_N % 16 == 0) {
-        return SM90_64x16x32_F16E5M2E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x16x64_F16E5M2E5M2_SS_TN<Args...>{};
       }
       else if constexpr (Tile_N % 8 == 0) {
-        return SM90_64x8x32_F16E5M2E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x8x64_F16E5M2E5M2_SS_TN<Args...>{};
       }
       else {
         static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
@@ -2097,76 +3399,151 @@ rs_op_selector()
 
     // Input A: half_t ; Input B: half_t
     if constexpr (is_same_v<ElementA, half_t> && is_same_v<ElementB, half_t>) {
-      static_assert(size<2>(TileShape_MNK{}) % 16 == 0, "Tile_K must be a multiple of 16.");
+      static_assert(size<2>(TileShape_MNK{}) % 32 == 0, "Tile_K must be a multiple of 32.");
 
       if constexpr (Tile_N % 256 == 0) {
-        return SM90_64x256x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x256x32_F32F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x248x32_F32F16F16_SS<MajorA, MajorB, Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 240 == 0) {
-        return SM90_64x240x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x240x32_F32F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x232x32_F32F16F16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 224 == 0) {
-        return SM90_64x224x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x224x32_F32F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x216x32_F32F16F16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 208 == 0) {
-        return SM90_64x208x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x208x32_F32F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x200x32_F32F16F16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 192 == 0) {
-        return SM90_64x192x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x192x32_F32F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x184x32_F32F16F16_SS<MajorA, MajorB, Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 176 == 0) {
-        return SM90_64x176x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x176x32_F32F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x168x32_F32F16F16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 160 == 0) {
-        return SM90_64x160x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x160x32_F32F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x152x32_F32F16F16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 144 == 0) {
-        return SM90_64x144x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x144x32_F32F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x136x32_F32F16F16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 128 == 0) {
-        return SM90_64x128x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x128x32_F32F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x120x32_F32F16F16_SS<MajorA, MajorB, Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 112 == 0) {
-        return SM90_64x112x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x112x32_F32F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x104x32_F32F16F16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 96 == 0) {
-        return SM90_64x96x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x96x32_F32F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x88x32_F32F16F16_SS<MajorA, MajorB, Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 80 == 0) {
-        return SM90_64x80x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x80x32_F32F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x72x32_F32F16F16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 64 == 0) {
-        return SM90_64x64x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x64x32_F32F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x56x32_F32F16F16_SS<MajorA, MajorB, Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 48 == 0) {
-        return SM90_64x48x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x48x32_F32F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x40x32_F32F16F16_SS<MajorA, MajorB, Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 32 == 0) {
-        return SM90_64x32x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x32x32_F32F16F16_SS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x24x32_F32F16F16_SS<MajorA, MajorB, Args...>{};
       }
+#endif
       else if constexpr (Tile_N % 16 == 0) {
-        return SM90_64x16x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x16x32_F32F16F16_SS<MajorA, MajorB, Args...>{};
       }
       else if constexpr (Tile_N % 8 == 0) {
-        return SM90_64x8x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x8x32_F32F16F16_SS<MajorA, MajorB, Args...>{};
       }
       else {
         static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
@@ -2175,76 +3552,4639 @@ rs_op_selector()
 
     // Input A: bfloat16_t ; Input B: bfloat16_t
     else if constexpr (is_same_v<ElementA, bfloat16_t> && is_same_v<ElementB, bfloat16_t>) {
+      static_assert(size<2>(TileShape_MNK{}) % 32 == 0, "Tile_K must be a multiple of 32.");
+
+      if constexpr (Tile_N % 256 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x256x32_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x248x32_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 240 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x240x32_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x232x32_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 224 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x224x32_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x216x32_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 208 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x208x32_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x200x32_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 192 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x192x32_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x184x32_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 176 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x176x32_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x168x32_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 160 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x160x32_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x152x32_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 144 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x144x32_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x136x32_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 128 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x128x32_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x120x32_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 112 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x112x32_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x104x32_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 96 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x96x32_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x88x32_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 80 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x80x32_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x72x32_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 64 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x64x32_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x56x32_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 48 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x48x32_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x40x32_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 32 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x32x32_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x24x32_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 16 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x16x32_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+      else if constexpr (Tile_N % 8 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x8x32_F32BF16BF16_SS<MajorA, MajorB, Args...>{};
+      }
+      else {
+        static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
+      }
+    }
+
+    // Input A: tfloat32_t ; Input B: tfloat32_t
+    else if constexpr (is_same_v<ElementA, tfloat32_t> && is_same_v<ElementB, tfloat32_t>) {
+      static_assert(MajorA == GMMA::Major::K, "MajorA must be GMMA::Major::K for this config.");
+      static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
       static_assert(size<2>(TileShape_MNK{}) % 16 == 0, "Tile_K must be a multiple of 16.");
 
       if constexpr (Tile_N % 256 == 0) {
-        return SM90_64x256x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x256x16_F32TF32TF32_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x248x16_F32TF32TF32_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 240 == 0) {
-        return SM90_64x240x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x240x16_F32TF32TF32_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x232x16_F32TF32TF32_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 224 == 0) {
-        return SM90_64x224x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x224x16_F32TF32TF32_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x216x16_F32TF32TF32_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 208 == 0) {
-        return SM90_64x208x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x208x16_F32TF32TF32_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x200x16_F32TF32TF32_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 192 == 0) {
-        return SM90_64x192x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x192x16_F32TF32TF32_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x184x16_F32TF32TF32_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 176 == 0) {
-        return SM90_64x176x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x176x16_F32TF32TF32_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x168x16_F32TF32TF32_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 160 == 0) {
-        return SM90_64x160x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x160x16_F32TF32TF32_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x152x16_F32TF32TF32_SS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 144 == 0) {
-        return SM90_64x144x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x144x16_F32TF32TF32_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x136x16_F32TF32TF32_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 128 == 0) {
-        return SM90_64x128x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x128x16_F32TF32TF32_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x120x16_F32TF32TF32_SS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 112 == 0) {
-        return SM90_64x112x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x112x16_F32TF32TF32_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x104x16_F32TF32TF32_SS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 96 == 0) {
-        return SM90_64x96x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x96x16_F32TF32TF32_SS_TN<Args...>{};
       }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x88x16_F32TF32TF32_SS_TN<Args...>{};
+      }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 80 == 0) {
-        return SM90_64x80x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x80x16_F32TF32TF32_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x72x16_F32TF32TF32_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 64 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x64x16_F32TF32TF32_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x56x16_F32TF32TF32_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 48 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x48x16_F32TF32TF32_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x40x16_F32TF32TF32_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 32 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x32x16_F32TF32TF32_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x24x16_F32TF32TF32_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 16 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x16x16_F32TF32TF32_SS_TN<Args...>{};
+      }
+      else if constexpr (Tile_N % 8 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x8x16_F32TF32TF32_SS_TN<Args...>{};
+      }
+      else {
+        static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
+      }
+    }
+
+    // Input A: float_e4m3_t ; Input B: float_e4m3_t
+    else if constexpr (is_same_v<ElementA, float_e4m3_t> && is_same_v<ElementB, float_e4m3_t>) {
+      static_assert(MajorA == GMMA::Major::K, "MajorA must be GMMA::Major::K for this config.");
+      static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
+      static_assert(size<2>(TileShape_MNK{}) % 64 == 0, "Tile_K must be a multiple of 64.");
+
+      if constexpr (Tile_N % 256 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x256x64_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x248x64_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 240 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x240x64_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x232x64_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 224 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x224x64_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x216x64_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 208 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x208x64_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x200x64_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 192 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x192x64_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x184x64_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 176 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x176x64_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x168x64_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 160 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x160x64_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x152x64_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 144 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x144x64_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x136x64_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 128 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x128x64_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x120x64_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 112 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x112x64_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x104x64_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 96 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x96x64_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x88x64_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 80 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x80x64_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x72x64_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 64 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x64x64_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x56x64_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 48 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x48x64_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x40x64_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 32 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x32x64_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x24x64_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 16 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x16x64_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+      else if constexpr (Tile_N % 8 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x8x64_F32E4M3E4M3_SS_TN<Args...>{};
+      }
+      else {
+        static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
+      }
+    }
+
+    // Input A: float_e4m3_t ; Input B: float_e5m2_t
+    else if constexpr (is_same_v<ElementA, float_e4m3_t> && is_same_v<ElementB, float_e5m2_t>) {
+      static_assert(MajorA == GMMA::Major::K, "MajorA must be GMMA::Major::K for this config.");
+      static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
+      static_assert(size<2>(TileShape_MNK{}) % 64 == 0, "Tile_K must be a multiple of 64.");
+
+      if constexpr (Tile_N % 256 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x256x64_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x248x64_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 240 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x240x64_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x232x64_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 224 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x224x64_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x216x64_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 208 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x208x64_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x200x64_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 192 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x192x64_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x184x64_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 176 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x176x64_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x168x64_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 160 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x160x64_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x152x64_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 144 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x144x64_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x136x64_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 128 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x128x64_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x120x64_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 112 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x112x64_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x104x64_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 96 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x96x64_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x88x64_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 80 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x80x64_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x72x64_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 64 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x64x64_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x56x64_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 48 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x48x64_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x40x64_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 32 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x32x64_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x24x64_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 16 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x16x64_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+      else if constexpr (Tile_N % 8 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x8x64_F32E4M3E5M2_SS_TN<Args...>{};
+      }
+      else {
+        static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
+      }
+    }
+
+    // Input A: float_e5m2_t ; Input B: float_e4m3_t
+    else if constexpr (is_same_v<ElementA, float_e5m2_t> && is_same_v<ElementB, float_e4m3_t>) {
+      static_assert(MajorA == GMMA::Major::K, "MajorA must be GMMA::Major::K for this config.");
+      static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
+      static_assert(size<2>(TileShape_MNK{}) % 64 == 0, "Tile_K must be a multiple of 64.");
+
+      if constexpr (Tile_N % 256 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x256x64_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x248x64_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 240 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x240x64_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x232x64_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 224 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x224x64_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x216x64_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 208 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x208x64_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x200x64_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 192 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x192x64_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x184x64_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 176 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x176x64_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x168x64_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 160 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x160x64_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x152x64_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 144 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x144x64_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x136x64_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 128 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x128x64_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x120x64_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 112 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x112x64_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x104x64_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 96 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x96x64_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x88x64_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 80 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x80x64_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x72x64_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 64 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x64x64_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x56x64_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 48 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x48x64_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x40x64_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 32 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x32x64_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x24x64_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 16 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x16x64_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+      else if constexpr (Tile_N % 8 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x8x64_F32E5M2E4M3_SS_TN<Args...>{};
+      }
+      else {
+        static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
+      }
+    }
+
+    // Input A: float_e5m2_t ; Input B: float_e5m2_t
+    else if constexpr (is_same_v<ElementA, float_e5m2_t> && is_same_v<ElementB, float_e5m2_t>) {
+      static_assert(MajorA == GMMA::Major::K, "MajorA must be GMMA::Major::K for this config.");
+      static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
+      static_assert(size<2>(TileShape_MNK{}) % 64 == 0, "Tile_K must be a multiple of 64.");
+
+      if constexpr (Tile_N % 256 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x256x64_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x248x64_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 240 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x240x64_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x232x64_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 224 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x224x64_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x216x64_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 208 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x208x64_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x200x64_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 192 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x192x64_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x184x64_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 176 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x176x64_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x168x64_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 160 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x160x64_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x152x64_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 144 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x144x64_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x136x64_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 128 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x128x64_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x120x64_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 112 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x112x64_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x104x64_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 96 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x96x64_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x88x64_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 80 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x80x64_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x72x64_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 64 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x64x64_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x56x64_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 48 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x48x64_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x40x64_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 32 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x32x64_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x24x64_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 16 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x16x64_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+      else if constexpr (Tile_N % 8 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x8x64_F32E5M2E5M2_SS_TN<Args...>{};
+      }
+      else {
+        static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
+      }
+    }
+
+    else {
+      static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration.");
+    }
+  }
+
+  // S32 accumulator
+  else if constexpr (is_same_v<ElementC, int32_t>) {
+
+    // Input A: int8_t ; Input B: int8_t
+    if constexpr (is_same_v<ElementA, int8_t> && is_same_v<ElementB, int8_t>) {
+      static_assert(MajorA == GMMA::Major::K, "MajorA must be GMMA::Major::K for this config.");
+      static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
+      static_assert(size<2>(TileShape_MNK{}) % 64 == 0, "Tile_K must be a multiple of 64.");
+
+      if constexpr (Tile_N % 256 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x256x64_S32S8S8_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 240 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x240x64_S32S8S8_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 224 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x224x64_S32S8S8_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 208 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x208x64_S32S8S8_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 192 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x192x64_S32S8S8_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 176 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x176x64_S32S8S8_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 160 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x160x64_S32S8S8_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 144 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x144x64_S32S8S8_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 128 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x128x64_S32S8S8_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 112 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x112x64_S32S8S8_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 96 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x96x64_S32S8S8_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 80 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x80x64_S32S8S8_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 64 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x64x64_S32S8S8_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 48 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x48x64_S32S8S8_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 32 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x32x64_S32S8S8_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x24x64_S32S8S8_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 16 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x16x64_S32S8S8_SS_TN<Args...>{};
+      }
+      else if constexpr (Tile_N % 8 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x8x64_S32S8S8_SS_TN<Args...>{};
+      }
+      else {
+        static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
+      }
+    }
+
+    // Input A: int8_t ; Input B: uint8_t
+    else if constexpr (is_same_v<ElementA, int8_t> && is_same_v<ElementB, uint8_t>) {
+      static_assert(MajorA == GMMA::Major::K, "MajorA must be GMMA::Major::K for this config.");
+      static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
+      static_assert(size<2>(TileShape_MNK{}) % 64 == 0, "Tile_K must be a multiple of 64.");
+
+      if constexpr (Tile_N % 256 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x256x64_S32S8U8_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 240 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x240x64_S32S8U8_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 224 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x224x64_S32S8U8_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 208 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x208x64_S32S8U8_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 192 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x192x64_S32S8U8_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 176 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x176x64_S32S8U8_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 160 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x160x64_S32S8U8_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 144 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x144x64_S32S8U8_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 128 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x128x64_S32S8U8_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 112 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x112x64_S32S8U8_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 96 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x96x64_S32S8U8_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 80 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x80x64_S32S8U8_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 64 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x64x64_S32S8U8_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 48 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x48x64_S32S8U8_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 32 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x32x64_S32S8U8_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x24x64_S32S8U8_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 16 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x16x64_S32S8U8_SS_TN<Args...>{};
+      }
+      else if constexpr (Tile_N % 8 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x8x64_S32S8U8_SS_TN<Args...>{};
+      }
+      else {
+        static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
+      }
+    }
+
+    // Input A: uint8_t ; Input B: int8_t
+    else if constexpr (is_same_v<ElementA, uint8_t> && is_same_v<ElementB, int8_t>) {
+      static_assert(MajorA == GMMA::Major::K, "MajorA must be GMMA::Major::K for this config.");
+      static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
+      static_assert(size<2>(TileShape_MNK{}) % 64 == 0, "Tile_K must be a multiple of 64.");
+
+      if constexpr (Tile_N % 256 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x256x64_S32U8S8_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 240 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x240x64_S32U8S8_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 224 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x224x64_S32U8S8_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 208 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x208x64_S32U8S8_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 192 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x192x64_S32U8S8_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 176 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x176x64_S32U8S8_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 160 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x160x64_S32U8S8_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 144 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x144x64_S32U8S8_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 128 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x128x64_S32U8S8_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 112 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x112x64_S32U8S8_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 96 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x96x64_S32U8S8_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 80 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x80x64_S32U8S8_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 64 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x64x64_S32U8S8_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 48 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x48x64_S32U8S8_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 32 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x32x64_S32U8S8_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x24x64_S32U8S8_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 16 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x16x64_S32U8S8_SS_TN<Args...>{};
+      }
+      else if constexpr (Tile_N % 8 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x8x64_S32U8S8_SS_TN<Args...>{};
+      }
+      else {
+        static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
+      }
+    }
+
+    // Input A: uint8_t ; Input B: uint8_t
+    else if constexpr (is_same_v<ElementA, uint8_t> && is_same_v<ElementB, uint8_t>) {
+      static_assert(MajorA == GMMA::Major::K, "MajorA must be GMMA::Major::K for this config.");
+      static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
+      static_assert(size<2>(TileShape_MNK{}) % 64 == 0, "Tile_K must be a multiple of 64.");
+
+      if constexpr (Tile_N % 256 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x256x64_S32U8U8_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 240 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x240x64_S32U8U8_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 224 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x224x64_S32U8U8_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 208 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x208x64_S32U8U8_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 192 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x192x64_S32U8U8_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 176 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x176x64_S32U8U8_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 160 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x160x64_S32U8U8_SS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 144 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x144x64_S32U8U8_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 128 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x128x64_S32U8U8_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 112 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x112x64_S32U8U8_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 96 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x96x64_S32U8U8_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 80 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x80x64_S32U8U8_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 64 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x64x64_S32U8U8_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 48 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x48x64_S32U8U8_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 32 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x32x64_S32U8U8_SS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x24x64_S32U8U8_SS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 16 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x16x64_S32U8U8_SS_TN<Args...>{};
+      }
+      else if constexpr (Tile_N % 8 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x8x64_S32U8U8_SS_TN<Args...>{};
+      }
+      else {
+        static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
+      }
+    }
+
+    else {
+      static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration.");
+    }
+  }
+
+  // Unknown accumulator type
+  else {
+    static_assert(sizeof(ElementC) == 0, "Unknown ElementC accumulator type.");
+  }
+}
+
+template <
+  class ElementA,
+  class ElementB,
+  class ElementC,
+  class TileShape_MNK,
+  GMMA::Major MajorA = GMMA::Major::K,
+  GMMA::Major MajorB = GMMA::Major::K,
+  auto... Args                         // e.g. GMMA::ScaleOut::One, [GMMA::ScaleIn::One, GMMA::ScaleIn::One]
+                                       // But most commonly leave empty for defaults
+>
+CUTE_HOST_DEVICE constexpr
+auto
+rs_op_selector()
+{
+  static_assert(is_static<TileShape_MNK>::value, "TileShape_MNK must be static.");
+  static_assert(rank(TileShape_MNK{}) == 3, "TileShape_MNK must be rank 3.");
+  static_assert(size<0>(TileShape_MNK{}) % 64 == 0, "Tile_M must be a multiple of 64.");
+  static_assert(MajorA == GMMA::Major::K, "Register source A operand GMMAs must have K-major A layout.");
+  auto Tile_N = size<1>(TileShape_MNK{});
+
+  // F16 accumulator
+  if constexpr (is_same_v<ElementC, half_t>) {
+
+    // Input A: half_t ; Input B: half_t
+    if constexpr (is_same_v<ElementA, half_t> && is_same_v<ElementB, half_t>) {
+      static_assert(size<2>(TileShape_MNK{}) % 16 == 0, "Tile_K must be a multiple of 16.");
+
+      if constexpr (Tile_N % 256 == 0) {
+        return SM90::GMMA::MMA_64x256x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::MMA_64x248x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 240 == 0) {
+        return SM90::GMMA::MMA_64x240x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::MMA_64x232x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 224 == 0) {
+        return SM90::GMMA::MMA_64x224x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::MMA_64x216x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 208 == 0) {
+        return SM90::GMMA::MMA_64x208x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::MMA_64x200x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 192 == 0) {
+        return SM90::GMMA::MMA_64x192x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::MMA_64x184x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 176 == 0) {
+        return SM90::GMMA::MMA_64x176x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::MMA_64x168x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 160 == 0) {
+        return SM90::GMMA::MMA_64x160x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::MMA_64x152x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 144 == 0) {
+        return SM90::GMMA::MMA_64x144x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::MMA_64x136x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 128 == 0) {
+        return SM90::GMMA::MMA_64x128x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::MMA_64x120x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 112 == 0) {
+        return SM90::GMMA::MMA_64x112x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::MMA_64x104x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 96 == 0) {
+        return SM90::GMMA::MMA_64x96x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::MMA_64x88x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 80 == 0) {
+        return SM90::GMMA::MMA_64x80x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::MMA_64x72x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 64 == 0) {
+        return SM90::GMMA::MMA_64x64x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::MMA_64x56x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 48 == 0) {
+        return SM90::GMMA::MMA_64x48x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::MMA_64x40x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 32 == 0) {
+        return SM90::GMMA::MMA_64x32x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::MMA_64x24x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 16 == 0) {
+        return SM90::GMMA::MMA_64x16x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+      else if constexpr (Tile_N % 8 == 0) {
+        return SM90::GMMA::MMA_64x8x16_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+      else {
+        static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
+      }
+    }
+
+    // Input A: float_e4m3_t ; Input B: float_e4m3_t
+    else if constexpr (is_same_v<ElementA, float_e4m3_t> && is_same_v<ElementB, float_e4m3_t>) {
+      static_assert(MajorA == GMMA::Major::K, "MajorA must be GMMA::Major::K for this config.");
+      static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
+      static_assert(size<2>(TileShape_MNK{}) % 32 == 0, "Tile_K must be a multiple of 32.");
+
+      if constexpr (Tile_N % 256 == 0) {
+        return SM90::GMMA::MMA_64x256x32_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::MMA_64x248x32_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 240 == 0) {
+        return SM90::GMMA::MMA_64x240x32_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::MMA_64x232x32_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 224 == 0) {
+        return SM90::GMMA::MMA_64x224x32_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::MMA_64x216x32_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 208 == 0) {
+        return SM90::GMMA::MMA_64x208x32_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::MMA_64x200x32_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 192 == 0) {
+        return SM90::GMMA::MMA_64x192x32_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::MMA_64x184x32_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 176 == 0) {
+        return SM90::GMMA::MMA_64x176x32_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::MMA_64x168x32_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 160 == 0) {
+        return SM90::GMMA::MMA_64x160x32_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::MMA_64x152x32_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 144 == 0) {
+        return SM90::GMMA::MMA_64x144x32_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::MMA_64x136x32_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 128 == 0) {
+        return SM90::GMMA::MMA_64x128x32_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::MMA_64x120x32_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 112 == 0) {
+        return SM90::GMMA::MMA_64x112x32_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::MMA_64x104x32_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 96 == 0) {
+        return SM90::GMMA::MMA_64x96x32_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::MMA_64x88x32_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 80 == 0) {
+        return SM90::GMMA::MMA_64x80x32_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::MMA_64x72x32_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 64 == 0) {
+        return SM90::GMMA::MMA_64x64x32_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::MMA_64x56x32_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 48 == 0) {
+        return SM90::GMMA::MMA_64x48x32_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::MMA_64x40x32_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 32 == 0) {
+        return SM90::GMMA::MMA_64x32x32_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::MMA_64x24x32_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 16 == 0) {
+        return SM90::GMMA::MMA_64x16x32_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+      else if constexpr (Tile_N % 8 == 0) {
+        return SM90::GMMA::MMA_64x8x32_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+      else {
+        static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
+      }
+    }
+
+    // Input A: float_e4m3_t ; Input B: float_e5m2_t
+    else if constexpr (is_same_v<ElementA, float_e4m3_t> && is_same_v<ElementB, float_e5m2_t>) {
+      static_assert(MajorA == GMMA::Major::K, "MajorA must be GMMA::Major::K for this config.");
+      static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
+      static_assert(size<2>(TileShape_MNK{}) % 32 == 0, "Tile_K must be a multiple of 32.");
+
+      if constexpr (Tile_N % 256 == 0) {
+        return SM90::GMMA::MMA_64x256x32_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::MMA_64x248x32_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 240 == 0) {
+        return SM90::GMMA::MMA_64x240x32_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::MMA_64x232x32_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 224 == 0) {
+        return SM90::GMMA::MMA_64x224x32_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::MMA_64x216x32_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 208 == 0) {
+        return SM90::GMMA::MMA_64x208x32_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::MMA_64x200x32_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 192 == 0) {
+        return SM90::GMMA::MMA_64x192x32_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::MMA_64x184x32_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 176 == 0) {
+        return SM90::GMMA::MMA_64x176x32_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::MMA_64x168x32_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 160 == 0) {
+        return SM90::GMMA::MMA_64x160x32_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::MMA_64x152x32_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 144 == 0) {
+        return SM90::GMMA::MMA_64x144x32_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::MMA_64x136x32_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 128 == 0) {
+        return SM90::GMMA::MMA_64x128x32_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::MMA_64x120x32_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 112 == 0) {
+        return SM90::GMMA::MMA_64x112x32_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::MMA_64x104x32_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 96 == 0) {
+        return SM90::GMMA::MMA_64x96x32_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::MMA_64x88x32_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 80 == 0) {
+        return SM90::GMMA::MMA_64x80x32_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::MMA_64x72x32_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 64 == 0) {
+        return SM90::GMMA::MMA_64x64x32_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::MMA_64x56x32_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 48 == 0) {
+        return SM90::GMMA::MMA_64x48x32_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::MMA_64x40x32_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 32 == 0) {
+        return SM90::GMMA::MMA_64x32x32_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::MMA_64x24x32_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 16 == 0) {
+        return SM90::GMMA::MMA_64x16x32_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+      else if constexpr (Tile_N % 8 == 0) {
+        return SM90::GMMA::MMA_64x8x32_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+      else {
+        static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
+      }
+    }
+
+    // Input A: float_e5m2_t ; Input B: float_e4m3_t
+    else if constexpr (is_same_v<ElementA, float_e5m2_t> && is_same_v<ElementB, float_e4m3_t>) {
+      static_assert(MajorA == GMMA::Major::K, "MajorA must be GMMA::Major::K for this config.");
+      static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
+      static_assert(size<2>(TileShape_MNK{}) % 32 == 0, "Tile_K must be a multiple of 32.");
+
+      if constexpr (Tile_N % 256 == 0) {
+        return SM90::GMMA::MMA_64x256x32_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::MMA_64x248x32_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 240 == 0) {
+        return SM90::GMMA::MMA_64x240x32_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::MMA_64x232x32_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 224 == 0) {
+        return SM90::GMMA::MMA_64x224x32_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::MMA_64x216x32_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 208 == 0) {
+        return SM90::GMMA::MMA_64x208x32_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::MMA_64x200x32_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 192 == 0) {
+        return SM90::GMMA::MMA_64x192x32_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::MMA_64x184x32_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 176 == 0) {
+        return SM90::GMMA::MMA_64x176x32_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::MMA_64x168x32_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 160 == 0) {
+        return SM90::GMMA::MMA_64x160x32_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::MMA_64x152x32_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 144 == 0) {
+        return SM90::GMMA::MMA_64x144x32_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::MMA_64x136x32_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 128 == 0) {
+        return SM90::GMMA::MMA_64x128x32_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::MMA_64x120x32_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 112 == 0) {
+        return SM90::GMMA::MMA_64x112x32_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::MMA_64x104x32_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 96 == 0) {
+        return SM90::GMMA::MMA_64x96x32_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::MMA_64x88x32_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 80 == 0) {
+        return SM90::GMMA::MMA_64x80x32_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::MMA_64x72x32_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 64 == 0) {
+        return SM90::GMMA::MMA_64x64x32_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::MMA_64x56x32_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 48 == 0) {
+        return SM90::GMMA::MMA_64x48x32_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::MMA_64x40x32_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 32 == 0) {
+        return SM90::GMMA::MMA_64x32x32_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::MMA_64x24x32_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 16 == 0) {
+        return SM90::GMMA::MMA_64x16x32_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+      else if constexpr (Tile_N % 8 == 0) {
+        return SM90::GMMA::MMA_64x8x32_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+      else {
+        static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
+      }
+    }
+
+    // Input A: float_e5m2_t ; Input B: float_e5m2_t
+    else if constexpr (is_same_v<ElementA, float_e5m2_t> && is_same_v<ElementB, float_e5m2_t>) {
+      static_assert(MajorA == GMMA::Major::K, "MajorA must be GMMA::Major::K for this config.");
+      static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
+      static_assert(size<2>(TileShape_MNK{}) % 32 == 0, "Tile_K must be a multiple of 32.");
+
+      if constexpr (Tile_N % 256 == 0) {
+        return SM90::GMMA::MMA_64x256x32_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::MMA_64x248x32_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 240 == 0) {
+        return SM90::GMMA::MMA_64x240x32_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::MMA_64x232x32_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 224 == 0) {
+        return SM90::GMMA::MMA_64x224x32_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::MMA_64x216x32_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 208 == 0) {
+        return SM90::GMMA::MMA_64x208x32_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::MMA_64x200x32_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 192 == 0) {
+        return SM90::GMMA::MMA_64x192x32_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::MMA_64x184x32_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 176 == 0) {
+        return SM90::GMMA::MMA_64x176x32_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::MMA_64x168x32_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 160 == 0) {
+        return SM90::GMMA::MMA_64x160x32_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::MMA_64x152x32_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 144 == 0) {
+        return SM90::GMMA::MMA_64x144x32_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::MMA_64x136x32_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 128 == 0) {
+        return SM90::GMMA::MMA_64x128x32_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::MMA_64x120x32_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 112 == 0) {
+        return SM90::GMMA::MMA_64x112x32_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::MMA_64x104x32_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 96 == 0) {
+        return SM90::GMMA::MMA_64x96x32_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::MMA_64x88x32_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 80 == 0) {
+        return SM90::GMMA::MMA_64x80x32_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::MMA_64x72x32_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 64 == 0) {
+        return SM90::GMMA::MMA_64x64x32_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::MMA_64x56x32_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 48 == 0) {
+        return SM90::GMMA::MMA_64x48x32_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::MMA_64x40x32_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 32 == 0) {
+        return SM90::GMMA::MMA_64x32x32_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::MMA_64x24x32_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 16 == 0) {
+        return SM90::GMMA::MMA_64x16x32_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+      else if constexpr (Tile_N % 8 == 0) {
+        return SM90::GMMA::MMA_64x8x32_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+      else {
+        static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
+      }
+    }
+
+    else {
+      static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration.");
+    }
+  }
+
+  // F32 accumulator
+  else if constexpr (is_same_v<ElementC, float>) {
+
+    // Input A: half_t ; Input B: half_t
+    if constexpr (is_same_v<ElementA, half_t> && is_same_v<ElementB, half_t>) {
+      static_assert(size<2>(TileShape_MNK{}) % 16 == 0, "Tile_K must be a multiple of 16.");
+
+      if constexpr (Tile_N % 256 == 0) {
+        return SM90::GMMA::MMA_64x256x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::MMA_64x248x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 240 == 0) {
+        return SM90::GMMA::MMA_64x240x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::MMA_64x232x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 224 == 0) {
+        return SM90::GMMA::MMA_64x224x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::MMA_64x216x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 208 == 0) {
+        return SM90::GMMA::MMA_64x208x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::MMA_64x200x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 192 == 0) {
+        return SM90::GMMA::MMA_64x192x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::MMA_64x184x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 176 == 0) {
+        return SM90::GMMA::MMA_64x176x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::MMA_64x168x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 160 == 0) {
+        return SM90::GMMA::MMA_64x160x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::MMA_64x152x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 144 == 0) {
+        return SM90::GMMA::MMA_64x144x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::MMA_64x136x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 128 == 0) {
+        return SM90::GMMA::MMA_64x128x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::MMA_64x120x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 112 == 0) {
+        return SM90::GMMA::MMA_64x112x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::MMA_64x104x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 96 == 0) {
+        return SM90::GMMA::MMA_64x96x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::MMA_64x88x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 80 == 0) {
+        return SM90::GMMA::MMA_64x80x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::MMA_64x72x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 64 == 0) {
+        return SM90::GMMA::MMA_64x64x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::MMA_64x56x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 48 == 0) {
+        return SM90::GMMA::MMA_64x48x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::MMA_64x40x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 32 == 0) {
+        return SM90::GMMA::MMA_64x32x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::MMA_64x24x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 16 == 0) {
+        return SM90::GMMA::MMA_64x16x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+      else if constexpr (Tile_N % 8 == 0) {
+        return SM90::GMMA::MMA_64x8x16_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+      else {
+        static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
+      }
+    }
+
+    // Input A: bfloat16_t ; Input B: bfloat16_t
+    else if constexpr (is_same_v<ElementA, bfloat16_t> && is_same_v<ElementB, bfloat16_t>) {
+      static_assert(size<2>(TileShape_MNK{}) % 16 == 0, "Tile_K must be a multiple of 16.");
+
+      if constexpr (Tile_N % 256 == 0) {
+        return SM90::GMMA::MMA_64x256x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::MMA_64x248x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 240 == 0) {
+        return SM90::GMMA::MMA_64x240x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::MMA_64x232x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 224 == 0) {
+        return SM90::GMMA::MMA_64x224x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::MMA_64x216x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 208 == 0) {
+        return SM90::GMMA::MMA_64x208x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::MMA_64x200x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 192 == 0) {
+        return SM90::GMMA::MMA_64x192x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::MMA_64x184x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 176 == 0) {
+        return SM90::GMMA::MMA_64x176x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::MMA_64x168x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 160 == 0) {
+        return SM90::GMMA::MMA_64x160x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::MMA_64x152x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 144 == 0) {
+        return SM90::GMMA::MMA_64x144x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::MMA_64x136x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 128 == 0) {
+        return SM90::GMMA::MMA_64x128x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::MMA_64x120x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 112 == 0) {
+        return SM90::GMMA::MMA_64x112x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::MMA_64x104x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 96 == 0) {
+        return SM90::GMMA::MMA_64x96x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::MMA_64x88x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 80 == 0) {
+        return SM90::GMMA::MMA_64x80x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::MMA_64x72x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 64 == 0) {
+        return SM90::GMMA::MMA_64x64x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::MMA_64x56x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 48 == 0) {
+        return SM90::GMMA::MMA_64x48x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::MMA_64x40x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 32 == 0) {
+        return SM90::GMMA::MMA_64x32x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::MMA_64x24x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 16 == 0) {
+        return SM90::GMMA::MMA_64x16x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+      else if constexpr (Tile_N % 8 == 0) {
+        return SM90::GMMA::MMA_64x8x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+      else {
+        static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
+      }
+    }
+
+    // Input A: tfloat32_t ; Input B: tfloat32_t
+    else if constexpr (is_same_v<ElementA, tfloat32_t> && is_same_v<ElementB, tfloat32_t>) {
+      static_assert(MajorA == GMMA::Major::K, "MajorA must be GMMA::Major::K for this config.");
+      static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
+      static_assert(size<2>(TileShape_MNK{}) % 8 == 0, "Tile_K must be a multiple of 8.");
+
+      if constexpr (Tile_N % 256 == 0) {
+        return SM90::GMMA::MMA_64x256x8_F32TF32TF32_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::MMA_64x248x8_F32TF32TF32_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 240 == 0) {
+        return SM90::GMMA::MMA_64x240x8_F32TF32TF32_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::MMA_64x232x8_F32TF32TF32_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 224 == 0) {
+        return SM90::GMMA::MMA_64x224x8_F32TF32TF32_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::MMA_64x216x8_F32TF32TF32_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 208 == 0) {
+        return SM90::GMMA::MMA_64x208x8_F32TF32TF32_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::MMA_64x200x8_F32TF32TF32_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 192 == 0) {
+        return SM90::GMMA::MMA_64x192x8_F32TF32TF32_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::MMA_64x184x8_F32TF32TF32_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 176 == 0) {
+        return SM90::GMMA::MMA_64x176x8_F32TF32TF32_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::MMA_64x168x8_F32TF32TF32_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 160 == 0) {
+        return SM90::GMMA::MMA_64x160x8_F32TF32TF32_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::MMA_64x152x8_F32TF32TF32_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 144 == 0) {
+        return SM90::GMMA::MMA_64x144x8_F32TF32TF32_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::MMA_64x136x8_F32TF32TF32_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 128 == 0) {
+        return SM90::GMMA::MMA_64x128x8_F32TF32TF32_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::MMA_64x120x8_F32TF32TF32_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 112 == 0) {
+        return SM90::GMMA::MMA_64x112x8_F32TF32TF32_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::MMA_64x104x8_F32TF32TF32_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 96 == 0) {
+        return SM90::GMMA::MMA_64x96x8_F32TF32TF32_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::MMA_64x88x8_F32TF32TF32_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 80 == 0) {
+        return SM90::GMMA::MMA_64x80x8_F32TF32TF32_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::MMA_64x72x8_F32TF32TF32_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 64 == 0) {
+        return SM90::GMMA::MMA_64x64x8_F32TF32TF32_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::MMA_64x56x8_F32TF32TF32_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 48 == 0) {
+        return SM90::GMMA::MMA_64x48x8_F32TF32TF32_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::MMA_64x40x8_F32TF32TF32_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 32 == 0) {
+        return SM90::GMMA::MMA_64x32x8_F32TF32TF32_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::MMA_64x24x8_F32TF32TF32_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 16 == 0) {
+        return SM90::GMMA::MMA_64x16x8_F32TF32TF32_RS_TN<Args...>{};
+      }
+      else if constexpr (Tile_N % 8 == 0) {
+        return SM90::GMMA::MMA_64x8x8_F32TF32TF32_RS_TN<Args...>{};
+      }
+      else {
+        static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
+      }
+    }
+
+    // Input A: float_e4m3_t ; Input B: float_e4m3_t
+    else if constexpr (is_same_v<ElementA, float_e4m3_t> && is_same_v<ElementB, float_e4m3_t>) {
+      static_assert(MajorA == GMMA::Major::K, "MajorA must be GMMA::Major::K for this config.");
+      static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
+      static_assert(size<2>(TileShape_MNK{}) % 32 == 0, "Tile_K must be a multiple of 32.");
+
+      if constexpr (Tile_N % 256 == 0) {
+        return SM90::GMMA::MMA_64x256x32_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::MMA_64x248x32_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 240 == 0) {
+        return SM90::GMMA::MMA_64x240x32_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::MMA_64x232x32_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 224 == 0) {
+        return SM90::GMMA::MMA_64x224x32_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::MMA_64x216x32_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 208 == 0) {
+        return SM90::GMMA::MMA_64x208x32_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::MMA_64x200x32_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 192 == 0) {
+        return SM90::GMMA::MMA_64x192x32_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::MMA_64x184x32_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 176 == 0) {
+        return SM90::GMMA::MMA_64x176x32_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::MMA_64x168x32_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 160 == 0) {
+        return SM90::GMMA::MMA_64x160x32_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::MMA_64x152x32_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 144 == 0) {
+        return SM90::GMMA::MMA_64x144x32_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::MMA_64x136x32_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 128 == 0) {
+        return SM90::GMMA::MMA_64x128x32_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::MMA_64x120x32_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 112 == 0) {
+        return SM90::GMMA::MMA_64x112x32_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::MMA_64x104x32_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 96 == 0) {
+        return SM90::GMMA::MMA_64x96x32_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::MMA_64x88x32_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 80 == 0) {
+        return SM90::GMMA::MMA_64x80x32_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::MMA_64x72x32_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 64 == 0) {
+        return SM90::GMMA::MMA_64x64x32_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::MMA_64x56x32_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 48 == 0) {
+        return SM90::GMMA::MMA_64x48x32_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::MMA_64x40x32_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 32 == 0) {
+        return SM90::GMMA::MMA_64x32x32_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::MMA_64x24x32_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 16 == 0) {
+        return SM90::GMMA::MMA_64x16x32_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+      else if constexpr (Tile_N % 8 == 0) {
+        return SM90::GMMA::MMA_64x8x32_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+      else {
+        static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
+      }
+    }
+
+    // Input A: float_e4m3_t ; Input B: float_e5m2_t
+    else if constexpr (is_same_v<ElementA, float_e4m3_t> && is_same_v<ElementB, float_e5m2_t>) {
+      static_assert(MajorA == GMMA::Major::K, "MajorA must be GMMA::Major::K for this config.");
+      static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
+      static_assert(size<2>(TileShape_MNK{}) % 32 == 0, "Tile_K must be a multiple of 32.");
+
+      if constexpr (Tile_N % 256 == 0) {
+        return SM90::GMMA::MMA_64x256x32_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::MMA_64x248x32_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 240 == 0) {
+        return SM90::GMMA::MMA_64x240x32_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::MMA_64x232x32_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 224 == 0) {
+        return SM90::GMMA::MMA_64x224x32_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::MMA_64x216x32_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 208 == 0) {
+        return SM90::GMMA::MMA_64x208x32_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::MMA_64x200x32_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 192 == 0) {
+        return SM90::GMMA::MMA_64x192x32_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::MMA_64x184x32_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 176 == 0) {
+        return SM90::GMMA::MMA_64x176x32_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::MMA_64x168x32_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 160 == 0) {
+        return SM90::GMMA::MMA_64x160x32_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::MMA_64x152x32_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 144 == 0) {
+        return SM90::GMMA::MMA_64x144x32_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::MMA_64x136x32_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 128 == 0) {
+        return SM90::GMMA::MMA_64x128x32_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::MMA_64x120x32_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 112 == 0) {
+        return SM90::GMMA::MMA_64x112x32_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::MMA_64x104x32_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 96 == 0) {
+        return SM90::GMMA::MMA_64x96x32_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::MMA_64x88x32_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 80 == 0) {
+        return SM90::GMMA::MMA_64x80x32_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::MMA_64x72x32_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 64 == 0) {
+        return SM90::GMMA::MMA_64x64x32_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::MMA_64x56x32_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 48 == 0) {
+        return SM90::GMMA::MMA_64x48x32_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::MMA_64x40x32_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 32 == 0) {
+        return SM90::GMMA::MMA_64x32x32_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::MMA_64x24x32_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 16 == 0) {
+        return SM90::GMMA::MMA_64x16x32_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+      else if constexpr (Tile_N % 8 == 0) {
+        return SM90::GMMA::MMA_64x8x32_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+      else {
+        static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
+      }
+    }
+
+    // Input A: float_e5m2_t ; Input B: float_e4m3_t
+    else if constexpr (is_same_v<ElementA, float_e5m2_t> && is_same_v<ElementB, float_e4m3_t>) {
+      static_assert(MajorA == GMMA::Major::K, "MajorA must be GMMA::Major::K for this config.");
+      static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
+      static_assert(size<2>(TileShape_MNK{}) % 32 == 0, "Tile_K must be a multiple of 32.");
+
+      if constexpr (Tile_N % 256 == 0) {
+        return SM90::GMMA::MMA_64x256x32_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::MMA_64x248x32_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 240 == 0) {
+        return SM90::GMMA::MMA_64x240x32_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::MMA_64x232x32_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 224 == 0) {
+        return SM90::GMMA::MMA_64x224x32_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::MMA_64x216x32_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 208 == 0) {
+        return SM90::GMMA::MMA_64x208x32_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::MMA_64x200x32_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 192 == 0) {
+        return SM90::GMMA::MMA_64x192x32_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::MMA_64x184x32_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 176 == 0) {
+        return SM90::GMMA::MMA_64x176x32_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::MMA_64x168x32_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 160 == 0) {
+        return SM90::GMMA::MMA_64x160x32_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::MMA_64x152x32_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 144 == 0) {
+        return SM90::GMMA::MMA_64x144x32_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::MMA_64x136x32_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 128 == 0) {
+        return SM90::GMMA::MMA_64x128x32_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::MMA_64x120x32_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 112 == 0) {
+        return SM90::GMMA::MMA_64x112x32_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::MMA_64x104x32_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 96 == 0) {
+        return SM90::GMMA::MMA_64x96x32_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::MMA_64x88x32_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 80 == 0) {
+        return SM90::GMMA::MMA_64x80x32_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::MMA_64x72x32_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 64 == 0) {
+        return SM90::GMMA::MMA_64x64x32_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::MMA_64x56x32_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 48 == 0) {
+        return SM90::GMMA::MMA_64x48x32_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::MMA_64x40x32_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 32 == 0) {
+        return SM90::GMMA::MMA_64x32x32_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::MMA_64x24x32_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 16 == 0) {
+        return SM90::GMMA::MMA_64x16x32_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+      else if constexpr (Tile_N % 8 == 0) {
+        return SM90::GMMA::MMA_64x8x32_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+      else {
+        static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
+      }
+    }
+
+    // Input A: float_e5m2_t ; Input B: float_e5m2_t
+    else if constexpr (is_same_v<ElementA, float_e5m2_t> && is_same_v<ElementB, float_e5m2_t>) {
+      static_assert(MajorA == GMMA::Major::K, "MajorA must be GMMA::Major::K for this config.");
+      static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
+      static_assert(size<2>(TileShape_MNK{}) % 32 == 0, "Tile_K must be a multiple of 32.");
+
+      if constexpr (Tile_N % 256 == 0) {
+        return SM90::GMMA::MMA_64x256x32_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::MMA_64x248x32_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 240 == 0) {
+        return SM90::GMMA::MMA_64x240x32_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::MMA_64x232x32_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 224 == 0) {
+        return SM90::GMMA::MMA_64x224x32_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::MMA_64x216x32_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 208 == 0) {
+        return SM90::GMMA::MMA_64x208x32_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::MMA_64x200x32_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 192 == 0) {
+        return SM90::GMMA::MMA_64x192x32_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::MMA_64x184x32_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 176 == 0) {
+        return SM90::GMMA::MMA_64x176x32_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::MMA_64x168x32_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 160 == 0) {
+        return SM90::GMMA::MMA_64x160x32_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::MMA_64x152x32_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 144 == 0) {
+        return SM90::GMMA::MMA_64x144x32_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::MMA_64x136x32_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 128 == 0) {
+        return SM90::GMMA::MMA_64x128x32_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::MMA_64x120x32_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 112 == 0) {
+        return SM90::GMMA::MMA_64x112x32_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::MMA_64x104x32_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 96 == 0) {
+        return SM90::GMMA::MMA_64x96x32_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::MMA_64x88x32_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 80 == 0) {
+        return SM90::GMMA::MMA_64x80x32_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::MMA_64x72x32_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 64 == 0) {
+        return SM90::GMMA::MMA_64x64x32_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::MMA_64x56x32_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 48 == 0) {
+        return SM90::GMMA::MMA_64x48x32_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::MMA_64x40x32_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 32 == 0) {
+        return SM90::GMMA::MMA_64x32x32_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::MMA_64x24x32_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 16 == 0) {
+        return SM90::GMMA::MMA_64x16x32_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+      else if constexpr (Tile_N % 8 == 0) {
+        return SM90::GMMA::MMA_64x8x32_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+      else {
+        static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
+      }
+    }
+
+    else {
+      static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration.");
+    }
+  }
+
+  // S32 accumulator
+  else if constexpr (is_same_v<ElementC, int32_t>) {
+
+    // Input A: int8_t ; Input B: int8_t
+    if constexpr (is_same_v<ElementA, int8_t> && is_same_v<ElementB, int8_t>) {
+      static_assert(MajorA == GMMA::Major::K, "MajorA must be GMMA::Major::K for this config.");
+      static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
+      static_assert(size<2>(TileShape_MNK{}) % 32 == 0, "Tile_K must be a multiple of 32.");
+
+      if constexpr (Tile_N % 256 == 0) {
+        return SM90::GMMA::MMA_64x256x32_S32S8S8_RS_TN{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 240 == 0) {
+        return SM90::GMMA::MMA_64x240x32_S32S8S8_RS_TN{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 224 == 0) {
+        return SM90::GMMA::MMA_64x224x32_S32S8S8_RS_TN{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 208 == 0) {
+        return SM90::GMMA::MMA_64x208x32_S32S8S8_RS_TN{};
+      }
+#endif
+      else if constexpr (Tile_N % 192 == 0) {
+        return SM90::GMMA::MMA_64x192x32_S32S8S8_RS_TN{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 176 == 0) {
+        return SM90::GMMA::MMA_64x176x32_S32S8S8_RS_TN{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 160 == 0) {
+        return SM90::GMMA::MMA_64x160x32_S32S8S8_RS_TN{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 144 == 0) {
+        return SM90::GMMA::MMA_64x144x32_S32S8S8_RS_TN{};
+      }
+#endif
+      else if constexpr (Tile_N % 128 == 0) {
+        return SM90::GMMA::MMA_64x128x32_S32S8S8_RS_TN{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 112 == 0) {
+        return SM90::GMMA::MMA_64x112x32_S32S8S8_RS_TN{};
+      }
+#endif
+      else if constexpr (Tile_N % 96 == 0) {
+        return SM90::GMMA::MMA_64x96x32_S32S8S8_RS_TN{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 80 == 0) {
+        return SM90::GMMA::MMA_64x80x32_S32S8S8_RS_TN{};
+      }
+#endif
+      else if constexpr (Tile_N % 64 == 0) {
+        return SM90::GMMA::MMA_64x64x32_S32S8S8_RS_TN{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 48 == 0) {
+        return SM90::GMMA::MMA_64x48x32_S32S8S8_RS_TN{};
+      }
+#endif
+      else if constexpr (Tile_N % 32 == 0) {
+        return SM90::GMMA::MMA_64x32x32_S32S8S8_RS_TN{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::MMA_64x24x32_S32S8S8_RS_TN{};
+      }
+#endif
+      else if constexpr (Tile_N % 16 == 0) {
+        return SM90::GMMA::MMA_64x16x32_S32S8S8_RS_TN{};
+      }
+      else if constexpr (Tile_N % 8 == 0) {
+        return SM90::GMMA::MMA_64x8x32_S32S8S8_RS_TN{};
+      }
+      else {
+        static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
+      }
+    }
+
+    // Input A: int8_t ; Input B: uint8_t
+    else if constexpr (is_same_v<ElementA, int8_t> && is_same_v<ElementB, uint8_t>) {
+      static_assert(MajorA == GMMA::Major::K, "MajorA must be GMMA::Major::K for this config.");
+      static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
+      static_assert(size<2>(TileShape_MNK{}) % 32 == 0, "Tile_K must be a multiple of 32.");
+
+      if constexpr (Tile_N % 256 == 0) {
+        return SM90::GMMA::MMA_64x256x32_S32S8U8_RS_TN{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 240 == 0) {
+        return SM90::GMMA::MMA_64x240x32_S32S8U8_RS_TN{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 224 == 0) {
+        return SM90::GMMA::MMA_64x224x32_S32S8U8_RS_TN{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 208 == 0) {
+        return SM90::GMMA::MMA_64x208x32_S32S8U8_RS_TN{};
+      }
+#endif
+      else if constexpr (Tile_N % 192 == 0) {
+        return SM90::GMMA::MMA_64x192x32_S32S8U8_RS_TN{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 176 == 0) {
+        return SM90::GMMA::MMA_64x176x32_S32S8U8_RS_TN{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 160 == 0) {
+        return SM90::GMMA::MMA_64x160x32_S32S8U8_RS_TN{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 144 == 0) {
+        return SM90::GMMA::MMA_64x144x32_S32S8U8_RS_TN{};
+      }
+#endif
+      else if constexpr (Tile_N % 128 == 0) {
+        return SM90::GMMA::MMA_64x128x32_S32S8U8_RS_TN{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 112 == 0) {
+        return SM90::GMMA::MMA_64x112x32_S32S8U8_RS_TN{};
+      }
+#endif
+      else if constexpr (Tile_N % 96 == 0) {
+        return SM90::GMMA::MMA_64x96x32_S32S8U8_RS_TN{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 80 == 0) {
+        return SM90::GMMA::MMA_64x80x32_S32S8U8_RS_TN{};
+      }
+#endif
+      else if constexpr (Tile_N % 64 == 0) {
+        return SM90::GMMA::MMA_64x64x32_S32S8U8_RS_TN{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 48 == 0) {
+        return SM90::GMMA::MMA_64x48x32_S32S8U8_RS_TN{};
+      }
+#endif
+      else if constexpr (Tile_N % 32 == 0) {
+        return SM90::GMMA::MMA_64x32x32_S32S8U8_RS_TN{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::MMA_64x24x32_S32S8U8_RS_TN{};
+      }
+#endif
+      else if constexpr (Tile_N % 16 == 0) {
+        return SM90::GMMA::MMA_64x16x32_S32S8U8_RS_TN{};
+      }
+      else if constexpr (Tile_N % 8 == 0) {
+        return SM90::GMMA::MMA_64x8x32_S32S8U8_RS_TN{};
+      }
+      else {
+        static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
+      }
+    }
+
+    // Input A: uint8_t ; Input B: int8_t
+    else if constexpr (is_same_v<ElementA, uint8_t> && is_same_v<ElementB, int8_t>) {
+      static_assert(MajorA == GMMA::Major::K, "MajorA must be GMMA::Major::K for this config.");
+      static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
+      static_assert(size<2>(TileShape_MNK{}) % 32 == 0, "Tile_K must be a multiple of 32.");
+
+      if constexpr (Tile_N % 256 == 0) {
+        return SM90::GMMA::MMA_64x256x32_S32U8S8_RS_TN{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 240 == 0) {
+        return SM90::GMMA::MMA_64x240x32_S32U8S8_RS_TN{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 224 == 0) {
+        return SM90::GMMA::MMA_64x224x32_S32U8S8_RS_TN{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 208 == 0) {
+        return SM90::GMMA::MMA_64x208x32_S32U8S8_RS_TN{};
+      }
+#endif
+      else if constexpr (Tile_N % 192 == 0) {
+        return SM90::GMMA::MMA_64x192x32_S32U8S8_RS_TN{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 176 == 0) {
+        return SM90::GMMA::MMA_64x176x32_S32U8S8_RS_TN{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 160 == 0) {
+        return SM90::GMMA::MMA_64x160x32_S32U8S8_RS_TN{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 144 == 0) {
+        return SM90::GMMA::MMA_64x144x32_S32U8S8_RS_TN{};
+      }
+#endif
+      else if constexpr (Tile_N % 128 == 0) {
+        return SM90::GMMA::MMA_64x128x32_S32U8S8_RS_TN{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 112 == 0) {
+        return SM90::GMMA::MMA_64x112x32_S32U8S8_RS_TN{};
+      }
+#endif
+      else if constexpr (Tile_N % 96 == 0) {
+        return SM90::GMMA::MMA_64x96x32_S32U8S8_RS_TN{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 80 == 0) {
+        return SM90::GMMA::MMA_64x80x32_S32U8S8_RS_TN{};
+      }
+#endif
+      else if constexpr (Tile_N % 64 == 0) {
+        return SM90::GMMA::MMA_64x64x32_S32U8S8_RS_TN{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 48 == 0) {
+        return SM90::GMMA::MMA_64x48x32_S32U8S8_RS_TN{};
+      }
+#endif
+      else if constexpr (Tile_N % 32 == 0) {
+        return SM90::GMMA::MMA_64x32x32_S32U8S8_RS_TN{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::MMA_64x24x32_S32U8S8_RS_TN{};
+      }
+#endif
+      else if constexpr (Tile_N % 16 == 0) {
+        return SM90::GMMA::MMA_64x16x32_S32U8S8_RS_TN{};
+      }
+      else if constexpr (Tile_N % 8 == 0) {
+        return SM90::GMMA::MMA_64x8x32_S32U8S8_RS_TN{};
+      }
+      else {
+        static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
+      }
+    }
+
+    // Input A: uint8_t ; Input B: uint8_t
+    else if constexpr (is_same_v<ElementA, uint8_t> && is_same_v<ElementB, uint8_t>) {
+      static_assert(MajorA == GMMA::Major::K, "MajorA must be GMMA::Major::K for this config.");
+      static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
+      static_assert(size<2>(TileShape_MNK{}) % 32 == 0, "Tile_K must be a multiple of 32.");
+
+      if constexpr (Tile_N % 256 == 0) {
+        return SM90::GMMA::MMA_64x256x32_S32U8U8_RS_TN{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 240 == 0) {
+        return SM90::GMMA::MMA_64x240x32_S32U8U8_RS_TN{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 224 == 0) {
+        return SM90::GMMA::MMA_64x224x32_S32U8U8_RS_TN{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 208 == 0) {
+        return SM90::GMMA::MMA_64x208x32_S32U8U8_RS_TN{};
+      }
+#endif
+      else if constexpr (Tile_N % 192 == 0) {
+        return SM90::GMMA::MMA_64x192x32_S32U8U8_RS_TN{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 176 == 0) {
+        return SM90::GMMA::MMA_64x176x32_S32U8U8_RS_TN{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 160 == 0) {
+        return SM90::GMMA::MMA_64x160x32_S32U8U8_RS_TN{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 144 == 0) {
+        return SM90::GMMA::MMA_64x144x32_S32U8U8_RS_TN{};
+      }
+#endif
+      else if constexpr (Tile_N % 128 == 0) {
+        return SM90::GMMA::MMA_64x128x32_S32U8U8_RS_TN{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 112 == 0) {
+        return SM90::GMMA::MMA_64x112x32_S32U8U8_RS_TN{};
+      }
+#endif
+      else if constexpr (Tile_N % 96 == 0) {
+        return SM90::GMMA::MMA_64x96x32_S32U8U8_RS_TN{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 80 == 0) {
+        return SM90::GMMA::MMA_64x80x32_S32U8U8_RS_TN{};
+      }
+#endif
+      else if constexpr (Tile_N % 64 == 0) {
+        return SM90::GMMA::MMA_64x64x32_S32U8U8_RS_TN{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 48 == 0) {
+        return SM90::GMMA::MMA_64x48x32_S32U8U8_RS_TN{};
+      }
+#endif
+      else if constexpr (Tile_N % 32 == 0) {
+        return SM90::GMMA::MMA_64x32x32_S32U8U8_RS_TN{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::MMA_64x24x32_S32U8U8_RS_TN{};
+      }
+#endif
+      else if constexpr (Tile_N % 16 == 0) {
+        return SM90::GMMA::MMA_64x16x32_S32U8U8_RS_TN{};
+      }
+      else if constexpr (Tile_N % 8 == 0) {
+        return SM90::GMMA::MMA_64x8x32_S32U8U8_RS_TN{};
+      }
+      else {
+        static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
+      }
+    }
+
+    else {
+      static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration.");
+    }
+  }
+
+  // Unknown accumulator type
+  else {
+    static_assert(sizeof(ElementC) == 0, "Unknown ElementC accumulator type.");
+  }
+}
+
+template <
+  class ElementA,
+  class ElementB,
+  class ElementC,
+  class TileShape_MNK,
+  GMMA::Major MajorA = GMMA::Major::K,
+  GMMA::Major MajorB = GMMA::Major::K,
+  auto... Args                         // e.g. GMMA::ScaleOut::One, [GMMA::ScaleIn::One, GMMA::ScaleIn::One]
+                                       // But most commonly leave empty for defaults
+>
+CUTE_HOST_DEVICE constexpr
+auto
+rs_op_selector_sparse()
+{
+  static_assert(is_static<TileShape_MNK>::value, "TileShape_MNK must be static.");
+  static_assert(rank(TileShape_MNK{}) == 3, "TileShape_MNK must be rank 3.");
+  static_assert(size<0>(TileShape_MNK{}) % 64 == 0, "Tile_M must be a multiple of 64.");
+  static_assert(MajorA == GMMA::Major::K, "Register source A operand GMMAs must have K-major A layout.");
+  auto Tile_N = size<1>(TileShape_MNK{});
+
+  // F16 accumulator
+  if constexpr (is_same_v<ElementC, half_t>) {
+
+    // Input A: half_t ; Input B: half_t
+    if constexpr (is_same_v<ElementA, half_t> && is_same_v<ElementB, half_t>) {
+      static_assert(size<2>(TileShape_MNK{}) % 32 == 0, "Tile_K must be a multiple of 32.");
+
+      if constexpr (Tile_N % 256 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x256x32_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x248x32_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 240 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x240x32_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x232x32_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 224 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x224x32_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x216x32_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 208 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x208x32_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x200x32_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 192 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x192x32_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x184x32_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 176 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x176x32_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x168x32_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 160 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x160x32_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x152x32_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 144 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x144x32_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x136x32_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 128 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x128x32_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x120x32_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 112 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x112x32_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x104x32_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 96 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x96x32_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x88x32_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 80 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x80x32_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x72x32_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 64 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x64x32_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x56x32_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 48 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x48x32_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x40x32_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 32 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x32x32_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x24x32_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 16 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x16x32_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+      else if constexpr (Tile_N % 8 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x8x32_F16F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+      else {
+        static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
+      }
+    }
+
+    // Input A: float_e4m3_t ; Input B: float_e4m3_t
+    else if constexpr (is_same_v<ElementA, float_e4m3_t> && is_same_v<ElementB, float_e4m3_t>) {
+      static_assert(MajorA == GMMA::Major::K, "MajorA must be GMMA::Major::K for this config.");
+      static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
+      static_assert(size<2>(TileShape_MNK{}) % 64 == 0, "Tile_K must be a multiple of 64.");
+
+      if constexpr (Tile_N % 256 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x256x64_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x248x64_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 240 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x240x64_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x232x64_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 224 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x224x64_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x216x64_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 208 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x208x64_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x200x64_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 192 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x192x64_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x184x64_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 176 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x176x64_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x168x64_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 160 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x160x64_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x152x64_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 144 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x144x64_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x136x64_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 128 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x128x64_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x120x64_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 112 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x112x64_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x104x64_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 96 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x96x64_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x88x64_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 80 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x80x64_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x72x64_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 64 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x64x64_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x56x64_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 48 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x48x64_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x40x64_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 32 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x32x64_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x24x64_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 16 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x16x64_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+      else if constexpr (Tile_N % 8 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x8x64_F16E4M3E4M3_RS_TN<Args...>{};
+      }
+      else {
+        static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
+      }
+    }
+
+    // Input A: float_e4m3_t ; Input B: float_e5m2_t
+    else if constexpr (is_same_v<ElementA, float_e4m3_t> && is_same_v<ElementB, float_e5m2_t>) {
+      static_assert(MajorA == GMMA::Major::K, "MajorA must be GMMA::Major::K for this config.");
+      static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
+      static_assert(size<2>(TileShape_MNK{}) % 64 == 0, "Tile_K must be a multiple of 64.");
+
+      if constexpr (Tile_N % 256 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x256x64_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x248x64_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 240 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x240x64_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x232x64_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 224 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x224x64_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x216x64_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 208 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x208x64_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x200x64_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 192 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x192x64_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x184x64_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 176 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x176x64_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x168x64_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 160 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x160x64_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x152x64_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 144 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x144x64_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x136x64_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 128 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x128x64_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x120x64_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 112 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x112x64_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x104x64_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 96 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x96x64_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x88x64_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 80 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x80x64_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x72x64_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 64 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x64x64_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x56x64_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 48 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x48x64_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x40x64_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 32 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x32x64_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x24x64_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 16 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x16x64_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+      else if constexpr (Tile_N % 8 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x8x64_F16E4M3E5M2_RS_TN<Args...>{};
+      }
+      else {
+        static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
+      }
+    }
+
+    // Input A: float_e5m2_t ; Input B: float_e4m3_t
+    else if constexpr (is_same_v<ElementA, float_e5m2_t> && is_same_v<ElementB, float_e4m3_t>) {
+      static_assert(MajorA == GMMA::Major::K, "MajorA must be GMMA::Major::K for this config.");
+      static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
+      static_assert(size<2>(TileShape_MNK{}) % 64 == 0, "Tile_K must be a multiple of 64.");
+
+      if constexpr (Tile_N % 256 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x256x64_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x248x64_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 240 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x240x64_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x232x64_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 224 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x224x64_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x216x64_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 208 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x208x64_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x200x64_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 192 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x192x64_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x184x64_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 176 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x176x64_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x168x64_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 160 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x160x64_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x152x64_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 144 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x144x64_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x136x64_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 128 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x128x64_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x120x64_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 112 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x112x64_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x104x64_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 96 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x96x64_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x88x64_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 80 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x80x64_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x72x64_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 64 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x64x64_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x56x64_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 48 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x48x64_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x40x64_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 32 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x32x64_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x24x64_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 16 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x16x64_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+      else if constexpr (Tile_N % 8 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x8x64_F16E5M2E4M3_RS_TN<Args...>{};
+      }
+      else {
+        static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
+      }
+    }
+
+    // Input A: float_e5m2_t ; Input B: float_e5m2_t
+    else if constexpr (is_same_v<ElementA, float_e5m2_t> && is_same_v<ElementB, float_e5m2_t>) {
+      static_assert(MajorA == GMMA::Major::K, "MajorA must be GMMA::Major::K for this config.");
+      static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
+      static_assert(size<2>(TileShape_MNK{}) % 64 == 0, "Tile_K must be a multiple of 64.");
+
+      if constexpr (Tile_N % 256 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x256x64_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x248x64_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 240 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x240x64_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x232x64_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 224 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x224x64_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x216x64_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 208 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x208x64_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x200x64_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 192 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x192x64_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x184x64_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 176 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x176x64_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x168x64_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 160 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x160x64_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x152x64_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 144 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x144x64_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x136x64_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 128 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x128x64_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x120x64_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 112 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x112x64_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x104x64_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 96 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x96x64_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x88x64_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 80 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x80x64_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x72x64_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 64 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x64x64_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x56x64_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 48 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x48x64_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x40x64_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 32 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x32x64_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x24x64_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 16 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x16x64_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+      else if constexpr (Tile_N % 8 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x8x64_F16E5M2E5M2_RS_TN<Args...>{};
+      }
+      else {
+        static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
+      }
+    }
+
+    else {
+      static_assert(sizeof(ElementA) == 0, "No eligible GMMA operator for request configuration.");
+    }
+  }
+
+  // F32 accumulator
+  else if constexpr (is_same_v<ElementC, float>) {
+
+    // Input A: half_t ; Input B: half_t
+    if constexpr (is_same_v<ElementA, half_t> && is_same_v<ElementB, half_t>) {
+      static_assert(size<2>(TileShape_MNK{}) % 32 == 0, "Tile_K must be a multiple of 32.");
+
+      if constexpr (Tile_N % 256 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x256x32_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x248x32_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 240 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x240x32_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x232x32_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 224 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x224x32_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x216x32_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 208 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x208x32_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x200x32_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 192 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x192x32_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x184x32_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 176 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x176x32_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x168x32_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 160 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x160x32_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x152x32_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 144 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x144x32_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x136x32_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 128 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x128x32_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x120x32_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 112 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x112x32_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x104x32_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 96 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x96x32_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x88x32_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 80 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x80x32_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x72x32_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 64 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x64x32_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x56x32_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 48 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x48x32_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x40x32_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 32 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x32x32_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x24x32_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 16 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x16x32_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+      else if constexpr (Tile_N % 8 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x8x32_F32F16F16_RS<MajorA, MajorB, Args...>{};
+      }
+      else {
+        static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
+      }
+    }
+
+    // Input A: bfloat16_t ; Input B: bfloat16_t
+    else if constexpr (is_same_v<ElementA, bfloat16_t> && is_same_v<ElementB, bfloat16_t>) {
+      static_assert(size<2>(TileShape_MNK{}) % 32 == 0, "Tile_K must be a multiple of 32.");
+
+      if constexpr (Tile_N % 256 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x256x32_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x248x32_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 240 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x240x32_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x232x32_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 224 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x224x32_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x216x32_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 208 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x208x32_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x200x32_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 192 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x192x32_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x184x32_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 176 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x176x32_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x168x32_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 160 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x160x32_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x152x32_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 144 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x144x32_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x136x32_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 128 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x128x32_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x120x32_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 112 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x112x32_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x104x32_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+      else if constexpr (Tile_N % 96 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x96x32_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x88x32_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 80 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x80x32_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x72x32_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 64 == 0) {
-        return SM90_64x64x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x64x32_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
       }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x56x32_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 48 == 0) {
-        return SM90_64x48x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x48x32_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x40x32_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 32 == 0) {
-        return SM90_64x32x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x32x32_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x24x32_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
       }
+#endif
       else if constexpr (Tile_N % 16 == 0) {
-        return SM90_64x16x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x16x32_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
       }
       else if constexpr (Tile_N % 8 == 0) {
-        return SM90_64x8x16_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x8x32_F32BF16BF16_RS<MajorA, MajorB, Args...>{};
       }
       else {
         static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
@@ -2255,76 +8195,151 @@ rs_op_selector()
     else if constexpr (is_same_v<ElementA, tfloat32_t> && is_same_v<ElementB, tfloat32_t>) {
       static_assert(MajorA == GMMA::Major::K, "MajorA must be GMMA::Major::K for this config.");
       static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
-      static_assert(size<2>(TileShape_MNK{}) % 8 == 0, "Tile_K must be a multiple of 8.");
+      static_assert(size<2>(TileShape_MNK{}) % 16 == 0, "Tile_K must be a multiple of 16.");
 
       if constexpr (Tile_N % 256 == 0) {
-        return SM90_64x256x8_F32TF32TF32_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x256x16_F32TF32TF32_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x248x16_F32TF32TF32_RS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 240 == 0) {
-        return SM90_64x240x8_F32TF32TF32_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x240x16_F32TF32TF32_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x232x16_F32TF32TF32_RS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 224 == 0) {
-        return SM90_64x224x8_F32TF32TF32_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x224x16_F32TF32TF32_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x216x16_F32TF32TF32_RS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 208 == 0) {
-        return SM90_64x208x8_F32TF32TF32_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x208x16_F32TF32TF32_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x200x16_F32TF32TF32_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 192 == 0) {
-        return SM90_64x192x8_F32TF32TF32_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x192x16_F32TF32TF32_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x184x16_F32TF32TF32_RS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 176 == 0) {
-        return SM90_64x176x8_F32TF32TF32_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x176x16_F32TF32TF32_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x168x16_F32TF32TF32_RS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 160 == 0) {
-        return SM90_64x160x8_F32TF32TF32_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x160x16_F32TF32TF32_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x152x16_F32TF32TF32_RS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 144 == 0) {
-        return SM90_64x144x8_F32TF32TF32_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x144x16_F32TF32TF32_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x136x16_F32TF32TF32_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 128 == 0) {
-        return SM90_64x128x8_F32TF32TF32_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x128x16_F32TF32TF32_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x120x16_F32TF32TF32_RS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 112 == 0) {
-        return SM90_64x112x8_F32TF32TF32_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x112x16_F32TF32TF32_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x104x16_F32TF32TF32_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 96 == 0) {
-        return SM90_64x96x8_F32TF32TF32_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x96x16_F32TF32TF32_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x88x16_F32TF32TF32_RS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 80 == 0) {
-        return SM90_64x80x8_F32TF32TF32_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x80x16_F32TF32TF32_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x72x16_F32TF32TF32_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 64 == 0) {
-        return SM90_64x64x8_F32TF32TF32_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x64x16_F32TF32TF32_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x56x16_F32TF32TF32_RS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 48 == 0) {
-        return SM90_64x48x8_F32TF32TF32_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x48x16_F32TF32TF32_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x40x16_F32TF32TF32_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 32 == 0) {
-        return SM90_64x32x8_F32TF32TF32_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x32x16_F32TF32TF32_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x24x16_F32TF32TF32_RS_TN<Args...>{};
       }
+#endif
       else if constexpr (Tile_N % 16 == 0) {
-        return SM90_64x16x8_F32TF32TF32_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x16x16_F32TF32TF32_RS_TN<Args...>{};
       }
       else if constexpr (Tile_N % 8 == 0) {
-        return SM90_64x8x8_F32TF32TF32_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x8x16_F32TF32TF32_RS_TN<Args...>{};
       }
       else {
         static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
@@ -2335,76 +8350,151 @@ rs_op_selector()
     else if constexpr (is_same_v<ElementA, float_e4m3_t> && is_same_v<ElementB, float_e4m3_t>) {
       static_assert(MajorA == GMMA::Major::K, "MajorA must be GMMA::Major::K for this config.");
       static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
-      static_assert(size<2>(TileShape_MNK{}) % 32 == 0, "Tile_K must be a multiple of 32.");
+      static_assert(size<2>(TileShape_MNK{}) % 64 == 0, "Tile_K must be a multiple of 64.");
 
       if constexpr (Tile_N % 256 == 0) {
-        return SM90_64x256x32_F32E4M3E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x256x64_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x248x64_F32E4M3E4M3_RS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 240 == 0) {
-        return SM90_64x240x32_F32E4M3E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x240x64_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x232x64_F32E4M3E4M3_RS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 224 == 0) {
-        return SM90_64x224x32_F32E4M3E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x224x64_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x216x64_F32E4M3E4M3_RS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 208 == 0) {
-        return SM90_64x208x32_F32E4M3E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x208x64_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x200x64_F32E4M3E4M3_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 192 == 0) {
-        return SM90_64x192x32_F32E4M3E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x192x64_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x184x64_F32E4M3E4M3_RS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 176 == 0) {
-        return SM90_64x176x32_F32E4M3E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x176x64_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x168x64_F32E4M3E4M3_RS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 160 == 0) {
-        return SM90_64x160x32_F32E4M3E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x160x64_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x152x64_F32E4M3E4M3_RS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 144 == 0) {
-        return SM90_64x144x32_F32E4M3E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x144x64_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x136x64_F32E4M3E4M3_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 128 == 0) {
-        return SM90_64x128x32_F32E4M3E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x128x64_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x120x64_F32E4M3E4M3_RS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 112 == 0) {
-        return SM90_64x112x32_F32E4M3E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x112x64_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x104x64_F32E4M3E4M3_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 96 == 0) {
-        return SM90_64x96x32_F32E4M3E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x96x64_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x88x64_F32E4M3E4M3_RS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 80 == 0) {
-        return SM90_64x80x32_F32E4M3E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x80x64_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x72x64_F32E4M3E4M3_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 64 == 0) {
-        return SM90_64x64x32_F32E4M3E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x64x64_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x56x64_F32E4M3E4M3_RS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 48 == 0) {
-        return SM90_64x48x32_F32E4M3E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x48x64_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x40x64_F32E4M3E4M3_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 32 == 0) {
-        return SM90_64x32x32_F32E4M3E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x32x64_F32E4M3E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x24x64_F32E4M3E4M3_RS_TN<Args...>{};
       }
+#endif
       else if constexpr (Tile_N % 16 == 0) {
-        return SM90_64x16x32_F32E4M3E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x16x64_F32E4M3E4M3_RS_TN<Args...>{};
       }
       else if constexpr (Tile_N % 8 == 0) {
-        return SM90_64x8x32_F32E4M3E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x8x64_F32E4M3E4M3_RS_TN<Args...>{};
       }
       else {
         static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
@@ -2415,76 +8505,151 @@ rs_op_selector()
     else if constexpr (is_same_v<ElementA, float_e4m3_t> && is_same_v<ElementB, float_e5m2_t>) {
       static_assert(MajorA == GMMA::Major::K, "MajorA must be GMMA::Major::K for this config.");
       static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
-      static_assert(size<2>(TileShape_MNK{}) % 32 == 0, "Tile_K must be a multiple of 32.");
+      static_assert(size<2>(TileShape_MNK{}) % 64 == 0, "Tile_K must be a multiple of 64.");
 
       if constexpr (Tile_N % 256 == 0) {
-        return SM90_64x256x32_F32E4M3E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x256x64_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x248x64_F32E4M3E5M2_RS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 240 == 0) {
-        return SM90_64x240x32_F32E4M3E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x240x64_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x232x64_F32E4M3E5M2_RS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 224 == 0) {
-        return SM90_64x224x32_F32E4M3E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x224x64_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x216x64_F32E4M3E5M2_RS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 208 == 0) {
-        return SM90_64x208x32_F32E4M3E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x208x64_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x200x64_F32E4M3E5M2_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 192 == 0) {
-        return SM90_64x192x32_F32E4M3E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x192x64_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x184x64_F32E4M3E5M2_RS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 176 == 0) {
-        return SM90_64x176x32_F32E4M3E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x176x64_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x168x64_F32E4M3E5M2_RS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 160 == 0) {
-        return SM90_64x160x32_F32E4M3E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x160x64_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x152x64_F32E4M3E5M2_RS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 144 == 0) {
-        return SM90_64x144x32_F32E4M3E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x144x64_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x136x64_F32E4M3E5M2_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 128 == 0) {
-        return SM90_64x128x32_F32E4M3E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x128x64_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x120x64_F32E4M3E5M2_RS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 112 == 0) {
-        return SM90_64x112x32_F32E4M3E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x112x64_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x104x64_F32E4M3E5M2_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 96 == 0) {
-        return SM90_64x96x32_F32E4M3E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x96x64_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x88x64_F32E4M3E5M2_RS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 80 == 0) {
-        return SM90_64x80x32_F32E4M3E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x80x64_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x72x64_F32E4M3E5M2_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 64 == 0) {
-        return SM90_64x64x32_F32E4M3E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x64x64_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x56x64_F32E4M3E5M2_RS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 48 == 0) {
-        return SM90_64x48x32_F32E4M3E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x48x64_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x40x64_F32E4M3E5M2_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 32 == 0) {
-        return SM90_64x32x32_F32E4M3E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x32x64_F32E4M3E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x24x64_F32E4M3E5M2_RS_TN<Args...>{};
       }
+#endif
       else if constexpr (Tile_N % 16 == 0) {
-        return SM90_64x16x32_F32E4M3E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x16x64_F32E4M3E5M2_RS_TN<Args...>{};
       }
       else if constexpr (Tile_N % 8 == 0) {
-        return SM90_64x8x32_F32E4M3E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x8x64_F32E4M3E5M2_RS_TN<Args...>{};
       }
       else {
         static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
@@ -2495,76 +8660,151 @@ rs_op_selector()
     else if constexpr (is_same_v<ElementA, float_e5m2_t> && is_same_v<ElementB, float_e4m3_t>) {
       static_assert(MajorA == GMMA::Major::K, "MajorA must be GMMA::Major::K for this config.");
       static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
-      static_assert(size<2>(TileShape_MNK{}) % 32 == 0, "Tile_K must be a multiple of 32.");
+      static_assert(size<2>(TileShape_MNK{}) % 64 == 0, "Tile_K must be a multiple of 64.");
 
       if constexpr (Tile_N % 256 == 0) {
-        return SM90_64x256x32_F32E5M2E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x256x64_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x248x64_F32E5M2E4M3_RS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 240 == 0) {
-        return SM90_64x240x32_F32E5M2E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x240x64_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x232x64_F32E5M2E4M3_RS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 224 == 0) {
-        return SM90_64x224x32_F32E5M2E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x224x64_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x216x64_F32E5M2E4M3_RS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 208 == 0) {
-        return SM90_64x208x32_F32E5M2E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x208x64_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x200x64_F32E5M2E4M3_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 192 == 0) {
-        return SM90_64x192x32_F32E5M2E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x192x64_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x184x64_F32E5M2E4M3_RS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 176 == 0) {
-        return SM90_64x176x32_F32E5M2E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x176x64_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x168x64_F32E5M2E4M3_RS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 160 == 0) {
-        return SM90_64x160x32_F32E5M2E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x160x64_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x152x64_F32E5M2E4M3_RS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 144 == 0) {
-        return SM90_64x144x32_F32E5M2E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x144x64_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x136x64_F32E5M2E4M3_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 128 == 0) {
-        return SM90_64x128x32_F32E5M2E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x128x64_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x120x64_F32E5M2E4M3_RS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 112 == 0) {
-        return SM90_64x112x32_F32E5M2E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x112x64_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x104x64_F32E5M2E4M3_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 96 == 0) {
-        return SM90_64x96x32_F32E5M2E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x96x64_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x88x64_F32E5M2E4M3_RS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 80 == 0) {
-        return SM90_64x80x32_F32E5M2E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x80x64_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x72x64_F32E5M2E4M3_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 64 == 0) {
-        return SM90_64x64x32_F32E5M2E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x64x64_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x56x64_F32E5M2E4M3_RS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 48 == 0) {
-        return SM90_64x48x32_F32E5M2E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x48x64_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x40x64_F32E5M2E4M3_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 32 == 0) {
-        return SM90_64x32x32_F32E5M2E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x32x64_F32E5M2E4M3_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x24x64_F32E5M2E4M3_RS_TN<Args...>{};
       }
+#endif
       else if constexpr (Tile_N % 16 == 0) {
-        return SM90_64x16x32_F32E5M2E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x16x64_F32E5M2E4M3_RS_TN<Args...>{};
       }
       else if constexpr (Tile_N % 8 == 0) {
-        return SM90_64x8x32_F32E5M2E4M3_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x8x64_F32E5M2E4M3_RS_TN<Args...>{};
       }
       else {
         static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
@@ -2575,76 +8815,151 @@ rs_op_selector()
     else if constexpr (is_same_v<ElementA, float_e5m2_t> && is_same_v<ElementB, float_e5m2_t>) {
       static_assert(MajorA == GMMA::Major::K, "MajorA must be GMMA::Major::K for this config.");
       static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
-      static_assert(size<2>(TileShape_MNK{}) % 32 == 0, "Tile_K must be a multiple of 32.");
+      static_assert(size<2>(TileShape_MNK{}) % 64 == 0, "Tile_K must be a multiple of 64.");
 
       if constexpr (Tile_N % 256 == 0) {
-        return SM90_64x256x32_F32E5M2E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x256x64_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 248 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x248x64_F32E5M2E5M2_RS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 240 == 0) {
-        return SM90_64x240x32_F32E5M2E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x240x64_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 232 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x232x64_F32E5M2E5M2_RS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 224 == 0) {
-        return SM90_64x224x32_F32E5M2E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x224x64_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 216 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x216x64_F32E5M2E5M2_RS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 208 == 0) {
-        return SM90_64x208x32_F32E5M2E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x208x64_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 200 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x200x64_F32E5M2E5M2_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 192 == 0) {
-        return SM90_64x192x32_F32E5M2E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x192x64_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 184 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x184x64_F32E5M2E5M2_RS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 176 == 0) {
-        return SM90_64x176x32_F32E5M2E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x176x64_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 168 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x168x64_F32E5M2E5M2_RS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 160 == 0) {
-        return SM90_64x160x32_F32E5M2E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x160x64_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 152 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x152x64_F32E5M2E5M2_RS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 144 == 0) {
-        return SM90_64x144x32_F32E5M2E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x144x64_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 136 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x136x64_F32E5M2E5M2_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 128 == 0) {
-        return SM90_64x128x32_F32E5M2E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x128x64_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 120 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x120x64_F32E5M2E5M2_RS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 112 == 0) {
-        return SM90_64x112x32_F32E5M2E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x112x64_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 104 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x104x64_F32E5M2E5M2_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 96 == 0) {
-        return SM90_64x96x32_F32E5M2E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x96x64_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 88 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x88x64_F32E5M2E5M2_RS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 80 == 0) {
-        return SM90_64x80x32_F32E5M2E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x80x64_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 72 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x72x64_F32E5M2E5M2_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 64 == 0) {
-        return SM90_64x64x32_F32E5M2E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x64x64_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 56 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x56x64_F32E5M2E5M2_RS_TN<Args...>{};
       }
+#endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 48 == 0) {
-        return SM90_64x48x32_F32E5M2E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x48x64_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#endif
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 40 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x40x64_F32E5M2E5M2_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 32 == 0) {
-        return SM90_64x32x32_F32E5M2E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x32x64_F32E5M2E5M2_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x24x64_F32E5M2E5M2_RS_TN<Args...>{};
       }
+#endif
       else if constexpr (Tile_N % 16 == 0) {
-        return SM90_64x16x32_F32E5M2E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x16x64_F32E5M2E5M2_RS_TN<Args...>{};
       }
       else if constexpr (Tile_N % 8 == 0) {
-        return SM90_64x8x32_F32E5M2E5M2_RS_TN<Args...>{};
+        return SM90::GMMA::SPARSE::GMMA_64x8x64_F32E5M2E5M2_RS_TN<Args...>{};
       }
       else {
         static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
@@ -2663,76 +8978,81 @@ rs_op_selector()
     if constexpr (is_same_v<ElementA, int8_t> && is_same_v<ElementB, int8_t>) {
       static_assert(MajorA == GMMA::Major::K, "MajorA must be GMMA::Major::K for this config.");
       static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
-      static_assert(size<2>(TileShape_MNK{}) % 32 == 0, "Tile_K must be a multiple of 32.");
+      static_assert(size<2>(TileShape_MNK{}) % 64 == 0, "Tile_K must be a multiple of 64.");
 
       if constexpr (Tile_N % 256 == 0) {
-        return SM90_64x256x32_S32S8S8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x256x64_S32S8S8_RS_TN<Args...>{};
       }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 240 == 0) {
-        return SM90_64x240x32_S32S8S8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x240x64_S32S8S8_RS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 224 == 0) {
-        return SM90_64x224x32_S32S8S8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x224x64_S32S8S8_RS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 208 == 0) {
-        return SM90_64x208x32_S32S8S8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x208x64_S32S8S8_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 192 == 0) {
-        return SM90_64x192x32_S32S8S8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x192x64_S32S8S8_RS_TN<Args...>{};
       }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 176 == 0) {
-        return SM90_64x176x32_S32S8S8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x176x64_S32S8S8_RS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 160 == 0) {
-        return SM90_64x160x32_S32S8S8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x160x64_S32S8S8_RS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 144 == 0) {
-        return SM90_64x144x32_S32S8S8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x144x64_S32S8S8_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 128 == 0) {
-        return SM90_64x128x32_S32S8S8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x128x64_S32S8S8_RS_TN<Args...>{};
       }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 112 == 0) {
-        return SM90_64x112x32_S32S8S8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x112x64_S32S8S8_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 96 == 0) {
-        return SM90_64x96x32_S32S8S8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x96x64_S32S8S8_RS_TN<Args...>{};
       }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 80 == 0) {
-        return SM90_64x80x32_S32S8S8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x80x64_S32S8S8_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 64 == 0) {
-        return SM90_64x64x32_S32S8S8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x64x64_S32S8S8_RS_TN<Args...>{};
       }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 48 == 0) {
-        return SM90_64x48x32_S32S8S8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x48x64_S32S8S8_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 32 == 0) {
-        return SM90_64x32x32_S32S8S8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x32x64_S32S8S8_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x24x64_S32S8S8_RS_TN<Args...>{};
       }
+#endif
       else if constexpr (Tile_N % 16 == 0) {
-        return SM90_64x16x32_S32S8S8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x16x64_S32S8S8_RS_TN<Args...>{};
       }
       else if constexpr (Tile_N % 8 == 0) {
-        return SM90_64x8x32_S32S8S8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x8x64_S32S8S8_RS_TN<Args...>{};
       }
       else {
         static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
@@ -2743,76 +9063,81 @@ rs_op_selector()
     else if constexpr (is_same_v<ElementA, int8_t> && is_same_v<ElementB, uint8_t>) {
       static_assert(MajorA == GMMA::Major::K, "MajorA must be GMMA::Major::K for this config.");
       static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
-      static_assert(size<2>(TileShape_MNK{}) % 32 == 0, "Tile_K must be a multiple of 32.");
+      static_assert(size<2>(TileShape_MNK{}) % 64 == 0, "Tile_K must be a multiple of 64.");
 
       if constexpr (Tile_N % 256 == 0) {
-        return SM90_64x256x32_S32S8U8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x256x64_S32S8U8_RS_TN<Args...>{};
       }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 240 == 0) {
-        return SM90_64x240x32_S32S8U8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x240x64_S32S8U8_RS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 224 == 0) {
-        return SM90_64x224x32_S32S8U8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x224x64_S32S8U8_RS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 208 == 0) {
-        return SM90_64x208x32_S32S8U8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x208x64_S32S8U8_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 192 == 0) {
-        return SM90_64x192x32_S32S8U8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x192x64_S32S8U8_RS_TN<Args...>{};
       }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 176 == 0) {
-        return SM90_64x176x32_S32S8U8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x176x64_S32S8U8_RS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 160 == 0) {
-        return SM90_64x160x32_S32S8U8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x160x64_S32S8U8_RS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 144 == 0) {
-        return SM90_64x144x32_S32S8U8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x144x64_S32S8U8_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 128 == 0) {
-        return SM90_64x128x32_S32S8U8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x128x64_S32S8U8_RS_TN<Args...>{};
       }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 112 == 0) {
-        return SM90_64x112x32_S32S8U8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x112x64_S32S8U8_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 96 == 0) {
-        return SM90_64x96x32_S32S8U8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x96x64_S32S8U8_RS_TN<Args...>{};
       }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 80 == 0) {
-        return SM90_64x80x32_S32S8U8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x80x64_S32S8U8_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 64 == 0) {
-        return SM90_64x64x32_S32S8U8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x64x64_S32S8U8_RS_TN<Args...>{};
       }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 48 == 0) {
-        return SM90_64x48x32_S32S8U8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x48x64_S32S8U8_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 32 == 0) {
-        return SM90_64x32x32_S32S8U8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x32x64_S32S8U8_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x24x64_S32S8U8_RS_TN<Args...>{};
       }
+#endif
       else if constexpr (Tile_N % 16 == 0) {
-        return SM90_64x16x32_S32S8U8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x16x64_S32S8U8_RS_TN<Args...>{};
       }
       else if constexpr (Tile_N % 8 == 0) {
-        return SM90_64x8x32_S32S8U8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x8x64_S32S8U8_RS_TN<Args...>{};
       }
       else {
         static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
@@ -2823,76 +9148,81 @@ rs_op_selector()
     else if constexpr (is_same_v<ElementA, uint8_t> && is_same_v<ElementB, int8_t>) {
       static_assert(MajorA == GMMA::Major::K, "MajorA must be GMMA::Major::K for this config.");
       static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
-      static_assert(size<2>(TileShape_MNK{}) % 32 == 0, "Tile_K must be a multiple of 32.");
+      static_assert(size<2>(TileShape_MNK{}) % 64 == 0, "Tile_K must be a multiple of 64.");
 
       if constexpr (Tile_N % 256 == 0) {
-        return SM90_64x256x32_S32U8S8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x256x64_S32U8S8_RS_TN<Args...>{};
       }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 240 == 0) {
-        return SM90_64x240x32_S32U8S8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x240x64_S32U8S8_RS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 224 == 0) {
-        return SM90_64x224x32_S32U8S8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x224x64_S32U8S8_RS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 208 == 0) {
-        return SM90_64x208x32_S32U8S8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x208x64_S32U8S8_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 192 == 0) {
-        return SM90_64x192x32_S32U8S8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x192x64_S32U8S8_RS_TN<Args...>{};
       }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 176 == 0) {
-        return SM90_64x176x32_S32U8S8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x176x64_S32U8S8_RS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 160 == 0) {
-        return SM90_64x160x32_S32U8S8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x160x64_S32U8S8_RS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 144 == 0) {
-        return SM90_64x144x32_S32U8S8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x144x64_S32U8S8_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 128 == 0) {
-        return SM90_64x128x32_S32U8S8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x128x64_S32U8S8_RS_TN<Args...>{};
       }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 112 == 0) {
-        return SM90_64x112x32_S32U8S8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x112x64_S32U8S8_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 96 == 0) {
-        return SM90_64x96x32_S32U8S8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x96x64_S32U8S8_RS_TN<Args...>{};
       }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 80 == 0) {
-        return SM90_64x80x32_S32U8S8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x80x64_S32U8S8_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 64 == 0) {
-        return SM90_64x64x32_S32U8S8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x64x64_S32U8S8_RS_TN<Args...>{};
       }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 48 == 0) {
-        return SM90_64x48x32_S32U8S8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x48x64_S32U8S8_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 32 == 0) {
-        return SM90_64x32x32_S32U8S8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x32x64_S32U8S8_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x24x64_S32U8S8_RS_TN<Args...>{};
       }
+#endif
       else if constexpr (Tile_N % 16 == 0) {
-        return SM90_64x16x32_S32U8S8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x16x64_S32U8S8_RS_TN<Args...>{};
       }
       else if constexpr (Tile_N % 8 == 0) {
-        return SM90_64x8x32_S32U8S8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x8x64_S32U8S8_RS_TN<Args...>{};
       }
       else {
         static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
@@ -2903,76 +9233,81 @@ rs_op_selector()
     else if constexpr (is_same_v<ElementA, uint8_t> && is_same_v<ElementB, uint8_t>) {
       static_assert(MajorA == GMMA::Major::K, "MajorA must be GMMA::Major::K for this config.");
       static_assert(MajorB == GMMA::Major::K, "MajorB must be GMMA::Major::K for this config.");
-      static_assert(size<2>(TileShape_MNK{}) % 32 == 0, "Tile_K must be a multiple of 32.");
+      static_assert(size<2>(TileShape_MNK{}) % 64 == 0, "Tile_K must be a multiple of 64.");
 
       if constexpr (Tile_N % 256 == 0) {
-        return SM90_64x256x32_S32U8U8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x256x64_S32U8U8_RS_TN<Args...>{};
       }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 240 == 0) {
-        return SM90_64x240x32_S32U8U8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x240x64_S32U8U8_RS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 224 == 0) {
-        return SM90_64x224x32_S32U8U8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x224x64_S32U8U8_RS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 208 == 0) {
-        return SM90_64x208x32_S32U8U8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x208x64_S32U8U8_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 192 == 0) {
-        return SM90_64x192x32_S32U8U8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x192x64_S32U8U8_RS_TN<Args...>{};
       }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 176 == 0) {
-        return SM90_64x176x32_S32U8U8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x176x64_S32U8U8_RS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 160 == 0) {
-        return SM90_64x160x32_S32U8U8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x160x64_S32U8U8_RS_TN<Args...>{};
       }
 #endif
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 144 == 0) {
-        return SM90_64x144x32_S32U8U8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x144x64_S32U8U8_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 128 == 0) {
-        return SM90_64x128x32_S32U8U8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x128x64_S32U8U8_RS_TN<Args...>{};
       }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 112 == 0) {
-        return SM90_64x112x32_S32U8U8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x112x64_S32U8U8_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 96 == 0) {
-        return SM90_64x96x32_S32U8U8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x96x64_S32U8U8_RS_TN<Args...>{};
       }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 80 == 0) {
-        return SM90_64x80x32_S32U8U8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x80x64_S32U8U8_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 64 == 0) {
-        return SM90_64x64x32_S32U8U8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x64x64_S32U8U8_RS_TN<Args...>{};
       }
 #if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
       else if constexpr (Tile_N % 48 == 0) {
-        return SM90_64x48x32_S32U8U8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x48x64_S32U8U8_RS_TN<Args...>{};
       }
 #endif
       else if constexpr (Tile_N % 32 == 0) {
-        return SM90_64x32x32_S32U8U8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x32x64_S32U8U8_RS_TN<Args...>{};
+      }
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+      else if constexpr (Tile_N % 24 == 0) {
+        return SM90::GMMA::SPARSE::GMMA_64x24x64_S32U8U8_RS_TN<Args...>{};
       }
+#endif
       else if constexpr (Tile_N % 16 == 0) {
-        return SM90_64x16x32_S32U8U8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x16x64_S32U8U8_RS_TN<Args...>{};
       }
       else if constexpr (Tile_N % 8 == 0) {
-        return SM90_64x8x32_S32U8U8_RS_TN{};
+        return SM90::GMMA::SPARSE::GMMA_64x8x64_S32U8U8_RS_TN<Args...>{};
       }
       else {
         static_assert(Tile_N % 8 == 0, "Tile_N must be a multiple of 8.");
@@ -2990,7 +9325,7 @@ rs_op_selector()
   }
 }
 
-} // end namespace GMMA
+} // end namespace SM90::GMMA
 } // end namespace cute
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/include/cute/arch/mma_sm90_desc.hpp b/include/cute/arch/mma_sm90_desc.hpp
index 1d6caba89d..a53a9748b4 100644
--- a/include/cute/arch/mma_sm90_desc.hpp
+++ b/include/cute/arch/mma_sm90_desc.hpp
@@ -48,8 +48,7 @@ namespace cute {
 // GMMA Descriptor and utilities
 
 // GMMA enums and utilities
-namespace GMMA
-{
+namespace SM90::GMMA {
 
 enum class LayoutType : uint8_t {
   INTERLEAVE = 0,
@@ -81,7 +80,7 @@ CUTE_HOST std::ostream& operator<<(std::ostream& os, LayoutType const& t) {
 }
 #endif // !defined(__CUDACC_RTC__)
 
-} // end namespace GMMA
+} // end namespace SM90::GMMA
 
 union GmmaDescriptor
 {
@@ -146,7 +145,7 @@ print(GmmaDescriptor const& t)
   printf("  leading_off:  0x%04x (%d)\n", t.bitfield.leading_byte_offset_, t.bitfield.leading_byte_offset_);
   printf("  stride_off :  0x%04x (%d)\n", t.bitfield.stride_byte_offset_, t.bitfield.stride_byte_offset_);
   printf("  base_offset:  0x%01x\n",      t.bitfield.base_offset_);
-  printf("  layout_type:  0x%01x (%s)\n", t.bitfield.layout_type_, to_string(static_cast<GMMA::LayoutType>(t.bitfield.layout_type_)));
+  printf("  layout_type:  0x%01x (%s)\n", t.bitfield.layout_type_, to_string(static_cast<SM90::GMMA::LayoutType>(t.bitfield.layout_type_)));
 #endif // !defined(__CUDACC_RTC__)
 }
 
diff --git a/include/cute/arch/mma_sm90_gmma.hpp b/include/cute/arch/mma_sm90_gmma.hpp
index aebb8fab5a..d809aa4a63 100644
--- a/include/cute/arch/mma_sm90_gmma.hpp
+++ b/include/cute/arch/mma_sm90_gmma.hpp
@@ -30,8 +30,10 @@
  **************************************************************************************************/
 #pragma once
 
-#include <cute/config.hpp>
-#include <cute/arch/mma.hpp>
+#include <cute/config.hpp>                 // CUTE_HOST_DEVICE
+
+#include "cutlass/arch/synclog.hpp"
+
 // Config
 #if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900) && defined(__CUDA_ARCH_FEAT_SM90_ALL))
 #  define CUTE_ARCH_MMA_SM90A_ENABLED
@@ -47,6 +49,7 @@ void
 warpgroup_arrive()
 {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+  cutlass::arch::synclog_emit_warpgroup_arrive(__LINE__);
   asm volatile ("wgmma.fence.sync.aligned;\n" ::: "memory");
 #else
   CUTE_INVALID_CONTROL_PATH("Attempting to use wgmma.fence without CUTE_ARCH_MMA_SM90A_ENABLED");
@@ -60,6 +63,7 @@ warpgroup_wait()
 {
   static_assert(N >= 0 && N <= 7, "WGMMA wait: N must be in range [0, 7]");
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+  cutlass::arch::synclog_emit_warpgroup_wait(__LINE__, N);
   asm volatile("wgmma.wait_group.sync.aligned %0;\n" :: "n"(N) : "memory");
 #else
   CUTE_INVALID_CONTROL_PATH("Attempting to use wgmma.wait_group<N> without CUTE_ARCH_MMA_SM90A_ENABLED");
@@ -72,6 +76,7 @@ void
 warpgroup_commit_batch()
 {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+  cutlass::arch::synclog_emit_warpgroup_commit_batch(__LINE__);
   asm volatile("wgmma.commit_group.sync.aligned;\n" ::: "memory");
 #else
   CUTE_INVALID_CONTROL_PATH("Attempting to use wgmma.commit_group without CUTE_ARCH_MMA_SM90A_ENABLED");
@@ -97,7 +102,7 @@ warpgroup_fence_operand(float& reg) {
 #endif
 }
 
-namespace GMMA {
+namespace SM90::GMMA {
 
 enum class Major {
   K  = 0,
@@ -114,7 +119,11 @@ enum class ScaleIn {
   One =  1
 };
 
-} // namespace GMMA
+enum class SparseSel {
+  Zero = 0,
+  One  = 1
+};
+
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 // GMMA PTX definitions:  C = (scaleA * A) * (scaleB * B) + (scaleD * C)
@@ -127,7 +136,7 @@ template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x8x16_F16F16F16_SS
+struct MMA_64x8x16_F16F16F16_SS
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
@@ -141,6 +150,7 @@ struct SM90_64x8x16_F16F16F16_SS
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
@@ -156,7 +166,7 @@ struct SM90_64x8x16_F16F16F16_SS
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x8x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x8x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
@@ -170,7 +180,7 @@ template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x8x16_F16F16F16_RS
+struct MMA_64x8x16_F16F16F16_RS
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
@@ -187,6 +197,7 @@ struct SM90_64x8x16_F16F16F16_RS
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
@@ -202,7 +213,7 @@ struct SM90_64x8x16_F16F16F16_RS
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x8x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x8x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
@@ -216,7 +227,7 @@ template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x16x16_F16F16F16_SS
+struct MMA_64x16x16_F16F16F16_SS
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
@@ -230,6 +241,7 @@ struct SM90_64x16x16_F16F16F16_SS
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
@@ -245,7 +257,7 @@ struct SM90_64x16x16_F16F16F16_SS
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x16x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x16x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
@@ -259,7 +271,7 @@ template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x16x16_F16F16F16_RS
+struct MMA_64x16x16_F16F16F16_RS
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
@@ -276,6 +288,7 @@ struct SM90_64x16x16_F16F16F16_RS
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
@@ -291,7 +304,7 @@ struct SM90_64x16x16_F16F16F16_RS
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x16x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x16x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
@@ -305,7 +318,7 @@ template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x32x16_F16F16F16_SS
+struct MMA_64x32x16_F16F16F16_SS
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
@@ -320,6 +333,7 @@ struct SM90_64x32x16_F16F16F16_SS
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
@@ -336,7 +350,7 @@ struct SM90_64x32x16_F16F16F16_SS
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x32x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x32x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
@@ -350,7 +364,7 @@ template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x32x16_F16F16F16_RS
+struct MMA_64x32x16_F16F16F16_RS
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
@@ -368,6 +382,7 @@ struct SM90_64x32x16_F16F16F16_RS
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
@@ -384,113 +399,10 @@ struct SM90_64x32x16_F16F16F16_RS
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x32x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x48x16 F16+=F16*F16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x48x16_F16F16F16_SS
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[12];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %14, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n48k16.f16.f16.f16 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11},"
-      " %12,"
-      " %13,"
-      " p,   %15, %16, %17, %18;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x48x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x32x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x48x16 F16+=F16*F16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x48x16_F16F16F16_RS
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[12];
-
-  static_assert(tnspA == GMMA::Major::K,
-      "Register source operand A must have K major layout.");
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %17, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n48k16.f16.f16.f16 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11},"
-      "{%12, %13, %14, %15},"
-      " %16,"
-      " p,   %18, %19, %20;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x48x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
@@ -501,7 +413,7 @@ template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x64x16_F16F16F16_SS
+struct MMA_64x64x16_F16F16F16_SS
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
@@ -518,6 +430,7 @@ struct SM90_64x64x16_F16F16F16_SS
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
@@ -537,7 +450,7 @@ struct SM90_64x64x16_F16F16F16_SS
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x64x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x64x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
@@ -551,7 +464,7 @@ template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x64x16_F16F16F16_RS
+struct MMA_64x64x16_F16F16F16_RS
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
@@ -571,6 +484,7 @@ struct SM90_64x64x16_F16F16F16_RS
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
@@ -590,126 +504,13 @@ struct SM90_64x64x16_F16F16F16_RS
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x64x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x64x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x80x16 F16+=F16*F16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x80x16_F16F16F16_SS
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[20];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %22, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n80k16.f16.f16.f16 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19},"
-      " %20,"
-      " %21,"
-      " p,   %23, %24, %25, %26;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x80x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x80x16 F16+=F16*F16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x80x16_F16F16F16_RS
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[20];
-
-  static_assert(tnspA == GMMA::Major::K,
-      "Register source operand A must have K major layout.");
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %25, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n80k16.f16.f16.f16 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19},"
-      "{%20, %21, %22, %23},"
-      " %24,"
-      " p,   %26, %27, %28;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x80x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
 // GMMA 64x96x16 F16+=F16*F16
 template <
   GMMA::Major tnspA,
@@ -717,7 +518,7 @@ template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x96x16_F16F16F16_SS
+struct MMA_64x96x16_F16F16F16_SS
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
@@ -736,6 +537,7 @@ struct SM90_64x96x16_F16F16F16_SS
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
@@ -758,7 +560,7 @@ struct SM90_64x96x16_F16F16F16_SS
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x96x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x96x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
@@ -772,7 +574,7 @@ template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x96x16_F16F16F16_RS
+struct MMA_64x96x16_F16F16F16_RS
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
@@ -794,6 +596,7 @@ struct SM90_64x96x16_F16F16F16_RS
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
@@ -816,133 +619,10 @@ struct SM90_64x96x16_F16F16F16_RS
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x96x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x112x16 F16+=F16*F16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x112x16_F16F16F16_SS
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[28];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %30, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n112k16.f16.f16.f16 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27},"
-      " %28,"
-      " %29,"
-      " p,   %31, %32, %33, %34;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x112x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x112x16 F16+=F16*F16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x112x16_F16F16F16_RS
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[28];
-
-  static_assert(tnspA == GMMA::Major::K,
-      "Register source operand A must have K major layout.");
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %33, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n112k16.f16.f16.f16 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27},"
-      "{%28, %29, %30, %31},"
-      " %32,"
-      " p,   %34, %35, %36;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x112x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x96x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
@@ -953,7 +633,7 @@ template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x128x16_F16F16F16_SS
+struct MMA_64x128x16_F16F16F16_SS
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
@@ -974,6 +654,7 @@ struct SM90_64x128x16_F16F16F16_SS
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
@@ -999,7 +680,7 @@ struct SM90_64x128x16_F16F16F16_SS
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x128x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x128x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
@@ -1013,7 +694,7 @@ template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x128x16_F16F16F16_RS
+struct MMA_64x128x16_F16F16F16_RS
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
@@ -1037,6 +718,7 @@ struct SM90_64x128x16_F16F16F16_RS
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
@@ -1062,27 +744,26 @@ struct SM90_64x128x16_F16F16F16_RS
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x128x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x128x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x144x16 F16+=F16*F16
+// GMMA 64x192x16 F16+=F16*F16
 template <
   GMMA::Major tnspA,
   GMMA::Major tnspB,
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x144x16_F16F16F16_SS
+struct MMA_64x192x16_F16F16F16_SS
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[36];
+  using CRegisters = uint32_t[48];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
@@ -1096,22 +777,27 @@ struct SM90_64x144x16_F16F16F16_SS
       uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
       uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
       uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %38, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n144k16.f16.f16.f16 "
+      "setp.ne.b32 p, %50, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n192k16.f16.f16.f16 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
       " %8,  %9,  %10, %11, %12, %13, %14, %15, "
       " %16, %17, %18, %19, %20, %21, %22, %23, "
       " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35},"
-      " %36,"
-      " %37,"
-      " p,   %39, %40, %41, %42;\n"
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45, %46, %47},"
+      " %48,"
+      " %49,"
+      " p,   %51, %52, %53, %54;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -1121,33 +807,34 @@ struct SM90_64x144x16_F16F16F16_SS
         "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
         "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
         "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35)
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x144x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x192x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x144x16 F16+=F16*F16
+// GMMA 64x192x16 F16+=F16*F16
 template <
   GMMA::Major tnspA,
   GMMA::Major tnspB,
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x144x16_F16F16F16_RS
+struct MMA_64x192x16_F16F16F16_RS
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[36];
+  using CRegisters = uint32_t[48];
 
   static_assert(tnspA == GMMA::Major::K,
       "Register source operand A must have K major layout.");
@@ -1164,22 +851,27 @@ struct SM90_64x144x16_F16F16F16_RS
       uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
       uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
       uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %41, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n144k16.f16.f16.f16 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35},"
-      "{%36, %37, %38, %39},"
-      " %40,"
-      " p,   %42, %43, %44;\n"
+      "setp.ne.b32 p, %53, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n192k16.f16.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
+      "{%48,  %49,  %50,  %51},"
+      " %52,"
+      " p,    %54,  %55,  %56;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -1189,33 +881,34 @@ struct SM90_64x144x16_F16F16F16_RS
         "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
         "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
         "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35)
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x144x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x192x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x160x16 F16+=F16*F16
+// GMMA 64x256x16 F16+=F16*F16
 template <
   GMMA::Major tnspA,
   GMMA::Major tnspB,
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x160x16_F16F16F16_SS
+struct MMA_64x256x16_F16F16F16_SS
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[40];
+  using CRegisters = uint32_t[64];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
@@ -1230,22 +923,32 @@ struct SM90_64x160x16_F16F16F16_SS
       uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
       uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
       uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %42, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n160k16.f16.f16.f16 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39},"
-      " %40,"
-      " %41,"
-      " p,   %43, %44, %45, %46;\n"
+      "setp.ne.b32 p, %66, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n256k16.f16.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      " %64,"
+      " %65,"
+      " p,    %67,  %68,  %69,  %70;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -1256,33 +959,37 @@ struct SM90_64x160x16_F16F16F16_SS
         "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
         "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
         "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x160x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x256x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x160x16 F16+=F16*F16
+// GMMA 64x256x16 F16+=F16*F16
 template <
   GMMA::Major tnspA,
   GMMA::Major tnspB,
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x160x16_F16F16F16_RS
+struct MMA_64x256x16_F16F16F16_RS
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[40];
+  using CRegisters = uint32_t[64];
 
   static_assert(tnspA == GMMA::Major::K,
       "Register source operand A must have K major layout.");
@@ -1300,22 +1007,32 @@ struct SM90_64x160x16_F16F16F16_RS
       uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
       uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
       uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %45, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n160k16.f16.f16.f16 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39},"
-      "{%40, %41, %42, %43},"
-      " %44,"
-      " p,   %46, %47, %48;\n"
+      "setp.ne.b32 p, %69, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n256k16.f16.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      "{%64,  %65,  %66,  %67},"
+      " %68,"
+      " p,    %70,  %71,  %72;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -1326,394 +1043,274 @@ struct SM90_64x160x16_F16F16F16_RS
         "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
         "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
         "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x160x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x256x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x176x16 F16+=F16*F16
+// GMMA 64x8x16 F32+=F16*F16
 template <
   GMMA::Major tnspA,
   GMMA::Major tnspB,
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x176x16_F16F16F16_SS
+struct MMA_64x8x16_F32F16F16_SS
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[44];
+  using CRegisters = float[4];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      float         & d0, float         & d1, float         & d2, float         & d3,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %46, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n176k16.f16.f16.f16 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39, "
-      " %40, %41, %42, %43},"
-      " %44,"
-      " %45,"
-      " p,   %47, %48, %49, %50;\n"
+      "setp.ne.b32 p, %6, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n8k16.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3},"
+      " %4,"
+      " %5,"
+      " p,   %7,  %8,  %9,  %10;\n"
     "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43)
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x176x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x8x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x176x16 F16+=F16*F16
+// GMMA 64x8x16 F32+=F16*F16
 template <
   GMMA::Major tnspA,
   GMMA::Major tnspB,
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x176x16_F16F16F16_RS
+struct MMA_64x8x16_F32F16F16_RS
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[44];
+  using CRegisters = float[4];
 
   static_assert(tnspA == GMMA::Major::K,
       "Register source operand A must have K major layout.");
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
       uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      float         & d0, float         & d1, float         & d2, float         & d3,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %49, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n176k16.f16.f16.f16 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39, "
-      " %40, %41, %42, %43},"
-      "{%44, %45, %46, %47},"
-      " %48,"
-      " p,   %50, %51, %52;\n"
+      "setp.ne.b32 p, %9, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n8k16.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3},"
+      "{%4,  %5,  %6,  %7},"
+      " %8,"
+      " p,   %10, %11, %12;\n"
     "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x176x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x8x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x192x16 F16+=F16*F16
+// GMMA 64x16x16 F32+=F16*F16
 template <
   GMMA::Major tnspA,
   GMMA::Major tnspB,
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x192x16_F16F16F16_SS
+struct MMA_64x16x16_F32F16F16_SS
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[48];
+  using CRegisters = float[8];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      float         & d0, float         & d1, float         & d2, float         & d3,
+      float         & d4, float         & d5, float         & d6, float         & d7,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %50, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n192k16.f16.f16.f16 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39, "
-      " %40, %41, %42, %43, %44, %45, %46, %47},"
-      " %48,"
-      " %49,"
-      " p,   %51, %52, %53, %54;\n"
+      "setp.ne.b32 p, %10, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n16k16.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      " %8,"
+      " %9,"
+      " p,   %11, %12, %13, %14;\n"
     "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3),
+        "+f"(d4), "+f"(d5), "+f"(d6), "+f"(d7)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x192x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x16x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x192x16 F16+=F16*F16
+// GMMA 64x16x16 F32+=F16*F16
 template <
   GMMA::Major tnspA,
   GMMA::Major tnspB,
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x192x16_F16F16F16_RS
+struct MMA_64x16x16_F32F16F16_RS
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[48];
+  using CRegisters = float[8];
 
   static_assert(tnspA == GMMA::Major::K,
       "Register source operand A must have K major layout.");
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
       uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      float         & d0, float         & d1, float         & d2, float         & d3,
+      float         & d4, float         & d5, float         & d6, float         & d7,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %53, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n192k16.f16.f16.f16 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
-      "{%48,  %49,  %50,  %51},"
-      " %52,"
-      " p,    %54,  %55,  %56;\n"
+      "setp.ne.b32 p, %13, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n16k16.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      "{%8,  %9,  %10, %11},"
+      " %12,"
+      " p,   %14, %15, %16;\n"
     "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3),
+        "+f"(d4), "+f"(d5), "+f"(d6), "+f"(d7)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x192x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x16x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x208x16 F16+=F16*F16
+// GMMA 64x32x16 F32+=F16*F16
 template <
   GMMA::Major tnspA,
   GMMA::Major tnspB,
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x208x16_F16F16F16_SS
+struct MMA_64x32x16_F32F16F16_SS
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[52];
+  using CRegisters = float[16];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %54, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n208k16.f16.f16.f16 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51},"
-      " %52,"
-      " %53,"
-      " p,    %55,  %56,  %57,  %58;\n"
+      "setp.ne.b32 p, %18, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n32k16.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      " %16,"
+      " %17,"
+      " p,   %19, %20, %21, %22;\n"
     "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51)
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x208x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x32x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x208x16 F16+=F16*F16
+// GMMA 64x32x16 F32+=F16*F16
 template <
   GMMA::Major tnspA,
   GMMA::Major tnspB,
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x208x16_F16F16F16_RS
+struct MMA_64x32x16_F32F16F16_RS
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[52];
+  using CRegisters = float[16];
 
   static_assert(tnspA == GMMA::Major::K,
       "Register source operand A must have K major layout.");
@@ -1721,154 +1318,114 @@ struct SM90_64x208x16_F16F16F16_RS
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
       uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %57, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n208k16.f16.f16.f16 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51},"
-      "{%52,  %53,  %54,  %55},"
-      " %56,"
-      " p,    %58,  %59,  %60;\n"
+      "setp.ne.b32 p, %21, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n32k16.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      "{%16, %17, %18, %19},"
+      " %20,"
+      " p,   %22, %23, %24;\n"
     "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51)
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x208x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x32x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x224x16 F16+=F16*F16
+// GMMA 64x64x16 F32+=F16*F16
 template <
   GMMA::Major tnspA,
   GMMA::Major tnspB,
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x224x16_F16F16F16_SS
+struct MMA_64x64x16_F32F16F16_SS
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[56];
+  using CRegisters = float[32];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %58, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n224k16.f16.f16.f16 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
-      " %56,"
-      " %57,"
-      " p,    %59,  %60,  %61,  %62;\n"
+      "setp.ne.b32 p, %34, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n64k16.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      " %32,"
+      " %33,"
+      " p,   %35, %36, %37, %38;\n"
     "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x224x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x64x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x224x16 F16+=F16*F16
+// GMMA 64x64x16 F32+=F16*F16
 template <
   GMMA::Major tnspA,
   GMMA::Major tnspB,
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x224x16_F16F16F16_RS
+struct MMA_64x64x16_F32F16F16_RS
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[56];
+  using CRegisters = float[32];
 
   static_assert(tnspA == GMMA::Major::K,
       "Register source operand A must have K major layout.");
@@ -1876,159 +1433,134 @@ struct SM90_64x224x16_F16F16F16_RS
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
       uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %61, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n224k16.f16.f16.f16 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
-      "{%56,  %57,  %58,  %59},"
-      " %60,"
-      " p,    %62,  %63,  %64;\n"
+      "setp.ne.b32 p, %37, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n64k16.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      "{%32, %33, %34, %35},"
+      " %36,"
+      " p,   %38, %39, %40;\n"
     "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x224x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x64x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x240x16 F16+=F16*F16
+// GMMA 64x96x16 F32+=F16*F16
 template <
   GMMA::Major tnspA,
   GMMA::Major tnspB,
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x240x16_F16F16F16_SS
+struct MMA_64x96x16_F32F16F16_SS
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[60];
+  using CRegisters = float[48];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %62, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n240k16.f16.f16.f16 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59},"
-      " %60,"
-      " %61,"
-      " p,    %63,  %64,  %65,  %66;\n"
+      "setp.ne.b32 p, %50, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n96k16.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45, %46, %47},"
+      " %48,"
+      " %49,"
+      " p,   %51, %52, %53, %54;\n"
     "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59)
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x240x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x96x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x240x16 F16+=F16*F16
+// GMMA 64x96x16 F32+=F16*F16
 template <
   GMMA::Major tnspA,
   GMMA::Major tnspB,
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x240x16_F16F16F16_RS
+struct MMA_64x96x16_F32F16F16_RS
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[60];
+  using CRegisters = float[48];
 
   static_assert(tnspA == GMMA::Major::K,
       "Register source operand A must have K major layout.");
@@ -2036,109 +1568,102 @@ struct SM90_64x240x16_F16F16F16_RS
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
       uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %65, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n240k16.f16.f16.f16 "
+      "setp.ne.b32 p, %53, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n96k16.f32.f16.f16 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
       " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59},"
-      "{%60,  %61,  %62,  %63},"
-      " %64,"
-      " p,    %66,  %67,  %68;\n"
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
+      "{%48,  %49,  %50,  %51},"
+      " %52,"
+      " p,    %54,  %55,  %56;\n"
     "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59)
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x240x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x96x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x256x16 F16+=F16*F16
+// GMMA 64x128x16 F32+=F16*F16
 template <
   GMMA::Major tnspA,
   GMMA::Major tnspB,
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x256x16_F16F16F16_SS
+struct MMA_64x128x16_F32F16F16_SS
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[64];
+  using CRegisters = float[64];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %66, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n256k16.f16.f16.f16 "
+      "wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -2151,46 +1676,46 @@ struct SM90_64x256x16_F16F16F16_SS
       " %65,"
       " p,    %67,  %68,  %69,  %70;\n"
     "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x256x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x128x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x256x16 F16+=F16*F16
+// GMMA 64x128x16 F32+=F16*F16
 template <
   GMMA::Major tnspA,
   GMMA::Major tnspB,
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x256x16_F16F16F16_RS
+struct MMA_64x128x16_F32F16F16_RS
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[64];
+  using CRegisters = float[64];
 
   static_assert(tnspA == GMMA::Major::K,
       "Register source operand A must have K major layout.");
@@ -2198,30 +1723,31 @@ struct SM90_64x256x16_F16F16F16_RS
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
       uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %69, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n256k16.f16.f16.f16 "
+      "wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -2234,448 +1760,682 @@ struct SM90_64x256x16_F16F16F16_RS
       " %68,"
       " p,    %70,  %71,  %72;\n"
     "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x256x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x128x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x8x16 F32+=F16*F16
+// GMMA 64x192x16 F32+=F16*F16
 template <
   GMMA::Major tnspA,
   GMMA::Major tnspB,
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x8x16_F32F16F16_SS
+struct MMA_64x192x16_F32F16F16_SS
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[4];
+  using CRegisters = float[96];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d0, float         & d1, float         & d2, float         & d3,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      float         & d92, float         & d93, float         & d94, float         & d95,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %6, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n8k16.f32.f16.f16 "
-      "{%0,  %1,  %2,  %3},"
-      " %4,"
-      " %5,"
-      " p,   %7,  %8,  %9,  %10;\n"
+      "setp.ne.b32 p, %98, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n192k16.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      " %96,"
+      " %97,"
+      " p,    %99,  %100, %101, %102;\n"
     "}\n"
-      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3)
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91),
+        "+f"(d92), "+f"(d93), "+f"(d94), "+f"(d95)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x8x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x192x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x8x16 F32+=F16*F16
+// GMMA 64x192x16 F32+=F16*F16
 template <
   GMMA::Major tnspA,
   GMMA::Major tnspB,
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x8x16_F32F16F16_RS
+struct MMA_64x192x16_F32F16F16_RS
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[4];
+  using CRegisters = float[96];
 
   static_assert(tnspA == GMMA::Major::K,
       "Register source operand A must have K major layout.");
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
       uint64_t const& desc_b,
-      float         & d0, float         & d1, float         & d2, float         & d3,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      float         & d92, float         & d93, float         & d94, float         & d95,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %9, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n8k16.f32.f16.f16 "
-      "{%0,  %1,  %2,  %3},"
-      "{%4,  %5,  %6,  %7},"
-      " %8,"
-      " p,   %10, %11, %12;\n"
+      "setp.ne.b32 p, %101, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n192k16.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      "{%96,  %97,  %98,  %99},"
+      " %100,"
+      " p,    %102, %103, %104;\n"
     "}\n"
-      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3)
-      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91),
+        "+f"(d92), "+f"(d93), "+f"(d94), "+f"(d95)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x8x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x192x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x16x16 F32+=F16*F16
+// GMMA 64x256x16 F32+=F16*F16
 template <
   GMMA::Major tnspA,
   GMMA::Major tnspB,
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x16x16_F32F16F16_SS
+struct MMA_64x256x16_F32F16F16_SS
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[8];
+  using CRegisters = float[128];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d0, float         & d1, float         & d2, float         & d3,
-      float         & d4, float         & d5, float         & d6, float         & d7,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      float         & d124, float         & d125, float         & d126, float         & d127,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %10, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n16k16.f32.f16.f16 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
-      " %8,"
-      " %9,"
-      " p,   %11, %12, %13, %14;\n"
-    "}\n"
-      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3),
-        "+f"(d4), "+f"(d5), "+f"(d6), "+f"(d7)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x16x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x16x16 F32+=F16*F16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x16x16_F32F16F16_RS
+      "setp.ne.b32 p, %130, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n256k16.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      " %128,"
+      " %129,"
+      " p,    %131, %132, %133, %134;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123),
+        "+f"(d124), "+f"(d125), "+f"(d126), "+f"(d127)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x256x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x256x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x256x16_F32F16F16_RS
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[8];
+  using CRegisters = float[128];
 
   static_assert(tnspA == GMMA::Major::K,
       "Register source operand A must have K major layout.");
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
       uint64_t const& desc_b,
-      float         & d0, float         & d1, float         & d2, float         & d3,
-      float         & d4, float         & d5, float         & d6, float         & d7,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      float         & d124, float         & d125, float         & d126, float         & d127,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %13, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n16k16.f32.f16.f16 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
-      "{%8,  %9,  %10, %11},"
-      " %12,"
-      " p,   %14, %15, %16;\n"
+      "setp.ne.b32 p, %133, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n256k16.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      "{%128, %129, %130, %131},"
+      " %132,"
+      " p,    %134, %135, %136;\n"
     "}\n"
-      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3),
-        "+f"(d4), "+f"(d5), "+f"(d6), "+f"(d7)
-      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123),
+        "+f"(d124), "+f"(d125), "+f"(d126), "+f"(d127)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x16x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x256x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x32x16 F32+=F16*F16
+// GMMA 64x8x16 F32+=BF16*BF16
 template <
   GMMA::Major tnspA,
   GMMA::Major tnspB,
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x32x16_F32F16F16_SS
+struct MMA_64x8x16_F32BF16BF16_SS
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[16];
+  using CRegisters = float[4];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d0, float         & d1, float         & d2, float         & d3,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %18, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n32k16.f32.f16.f16 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
-      " %16,"
-      " %17,"
-      " p,   %19, %20, %21, %22;\n"
+      "setp.ne.b32 p, %6, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n8k16.f32.bf16.bf16 "
+      "{%0,  %1,  %2,  %3},"
+      " %4,"
+      " %5,"
+      " p,   %7,  %8,  %9,  %10;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15)
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x32x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x8x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x32x16 F32+=F16*F16
+// GMMA 64x8x16 F32+=BF16*BF16
 template <
   GMMA::Major tnspA,
   GMMA::Major tnspB,
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x32x16_F32F16F16_RS
+struct MMA_64x8x16_F32BF16BF16_RS
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[16];
+  using CRegisters = float[4];
 
   static_assert(tnspA == GMMA::Major::K,
       "Register source operand A must have K major layout.");
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d0, float         & d1, float         & d2, float         & d3,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %21, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n32k16.f32.f16.f16 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
-      "{%16, %17, %18, %19},"
-      " %20,"
-      " p,   %22, %23, %24;\n"
+      "setp.ne.b32 p, %9, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n8k16.f32.bf16.bf16 "
+      "{%0,  %1,  %2,  %3},"
+      "{%4,  %5,  %6,  %7},"
+      " %8,"
+      " p,   %10, %11, %12;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x32x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x8x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x48x16 F32+=F16*F16
+// GMMA 64x16x16 F32+=BF16*BF16
 template <
   GMMA::Major tnspA,
   GMMA::Major tnspB,
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x48x16_F32F16F16_SS
+struct MMA_64x16x16_F32BF16BF16_SS
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[24];
+  using CRegisters = float[8];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d0, float         & d1, float         & d2, float         & d3,
+      float         & d4, float         & d5, float         & d6, float         & d7,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %26, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n48k16.f32.f16.f16 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23},"
-      " %24,"
-      " %25,"
-      " p,   %27, %28, %29, %30;\n"
+      "setp.ne.b32 p, %10, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n16k16.f32.bf16.bf16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      " %8,"
+      " %9,"
+      " p,   %11, %12, %13, %14;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3),
+        "+f"(d4), "+f"(d5), "+f"(d6), "+f"(d7)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x48x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x16x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x48x16 F32+=F16*F16
+// GMMA 64x16x16 F32+=BF16*BF16
 template <
   GMMA::Major tnspA,
   GMMA::Major tnspB,
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x48x16_F32F16F16_RS
+struct MMA_64x16x16_F32BF16BF16_RS
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[24];
+  using CRegisters = float[8];
 
   static_assert(tnspA == GMMA::Major::K,
       "Register source operand A must have K major layout.");
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d0, float         & d1, float         & d2, float         & d3,
+      float         & d4, float         & d5, float         & d6, float         & d7,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %29, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n48k16.f32.f16.f16 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23},"
-      "{%24, %25, %26, %27},"
-      " %28,"
-      " p,   %30, %31, %32;\n"
+      "setp.ne.b32 p, %13, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n16k16.f32.bf16.bf16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      "{%8,  %9,  %10, %11},"
+      " %12,"
+      " p,   %14, %15, %16;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3),
+        "+f"(d4), "+f"(d5), "+f"(d6), "+f"(d7)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x48x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x16x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x64x16 F32+=F16*F16
+// GMMA 64x32x16 F32+=BF16*BF16
 template <
   GMMA::Major tnspA,
   GMMA::Major tnspB,
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x64x16_F32F16F16_SS
+struct MMA_64x32x16_F32BF16BF16_SS
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[32];
+  using CRegisters = float[16];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
@@ -2684,58 +2444,49 @@ struct SM90_64x64x16_F32F16F16_SS
       float         & d04, float         & d05, float         & d06, float         & d07,
       float         & d08, float         & d09, float         & d10, float         & d11,
       float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %34, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n64k16.f32.f16.f16 "
+      "setp.ne.b32 p, %18, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n32k16.f32.bf16.bf16 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31},"
-      " %32,"
-      " %33,"
-      " p,   %35, %36, %37, %38;\n"
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      " %16,"
+      " %17,"
+      " p,   %19, %20, %21, %22;\n"
     "}\n"
       : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
         "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
         "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31)
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x64x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x32x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x64x16 F32+=F16*F16
+// GMMA 64x32x16 F32+=BF16*BF16
 template <
   GMMA::Major tnspA,
   GMMA::Major tnspB,
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x64x16_F32F16F16_RS
+struct MMA_64x32x16_F32BF16BF16_RS
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[32];
+  using CRegisters = float[16];
 
   static_assert(tnspA == GMMA::Major::K,
       "Register source operand A must have K major layout.");
@@ -2747,59 +2498,49 @@ struct SM90_64x64x16_F32F16F16_RS
       float         & d04, float         & d05, float         & d06, float         & d07,
       float         & d08, float         & d09, float         & d10, float         & d11,
       float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %37, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n64k16.f32.f16.f16 "
+      "setp.ne.b32 p, %21, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n32k16.f32.bf16.bf16 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31},"
-      "{%32, %33, %34, %35},"
-      " %36,"
-      " p,   %38, %39, %40;\n"
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      "{%16, %17, %18, %19},"
+      " %20,"
+      " p,   %22, %23, %24;\n"
     "}\n"
       : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
         "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
         "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31)
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x64x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x32x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x80x16 F32+=F16*F16
+// GMMA 64x64x16 F32+=BF16*BF16
 template <
   GMMA::Major tnspA,
   GMMA::Major tnspB,
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x80x16_F32F16F16_SS
+struct MMA_64x64x16_F32BF16BF16_SS
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[40];
+  using CRegisters = float[32];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
@@ -2812,24 +2553,22 @@ struct SM90_64x80x16_F32F16F16_SS
       float         & d20, float         & d21, float         & d22, float         & d23,
       float         & d24, float         & d25, float         & d26, float         & d27,
       float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %42, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n80k16.f32.f16.f16 "
+      "setp.ne.b32 p, %34, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n64k16.f32.bf16.bf16 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
       " %8,  %9,  %10, %11, %12, %13, %14, %15, "
       " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39},"
-      " %40,"
-      " %41,"
-      " p,   %43, %44, %45, %46;\n"
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      " %32,"
+      " %33,"
+      " p,   %35, %36, %37, %38;\n"
     "}\n"
       : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
         "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
@@ -2838,35 +2577,31 @@ struct SM90_64x80x16_F32F16F16_SS
         "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
         "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
         "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x80x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x64x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x80x16 F32+=F16*F16
+// GMMA 64x64x16 F32+=BF16*BF16
 template <
   GMMA::Major tnspA,
   GMMA::Major tnspB,
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x80x16_F32F16F16_RS
+struct MMA_64x64x16_F32BF16BF16_RS
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[40];
+  using CRegisters = float[32];
 
   static_assert(tnspA == GMMA::Major::K,
       "Register source operand A must have K major layout.");
@@ -2882,24 +2617,22 @@ struct SM90_64x80x16_F32F16F16_RS
       float         & d20, float         & d21, float         & d22, float         & d23,
       float         & d24, float         & d25, float         & d26, float         & d27,
       float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %45, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n80k16.f32.f16.f16 "
+      "setp.ne.b32 p, %37, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n64k16.f32.bf16.bf16 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
       " %8,  %9,  %10, %11, %12, %13, %14, %15, "
       " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39},"
-      "{%40, %41, %42, %43},"
-      " %44,"
-      " p,   %46, %47, %48;\n"
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      "{%32, %33, %34, %35},"
+      " %36,"
+      " p,   %38, %39, %40;\n"
     "}\n"
       : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
         "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
@@ -2908,29 +2641,26 @@ struct SM90_64x80x16_F32F16F16_RS
         "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
         "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
         "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x80x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x64x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x96x16 F32+=F16*F16
+// GMMA 64x96x16 F32+=BF16*BF16
 template <
   GMMA::Major tnspA,
   GMMA::Major tnspB,
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x96x16_F32F16F16_SS
+struct MMA_64x96x16_F32BF16BF16_SS
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
@@ -2955,11 +2685,12 @@ struct SM90_64x96x16_F32F16F16_SS
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %50, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n96k16.f32.f16.f16 "
+      "wgmma.mma_async.sync.aligned.m64n96k16.f32.bf16.bf16 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
       " %8,  %9,  %10, %11, %12, %13, %14, %15, "
       " %16, %17, %18, %19, %20, %21, %22, %23, "
@@ -2986,21 +2717,21 @@ struct SM90_64x96x16_F32F16F16_SS
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x96x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x96x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x96x16 F32+=F16*F16
+// GMMA 64x96x16 F32+=BF16*BF16
 template <
   GMMA::Major tnspA,
   GMMA::Major tnspB,
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x96x16_F32F16F16_RS
+struct MMA_64x96x16_F32BF16BF16_RS
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
@@ -3028,11 +2759,12 @@ struct SM90_64x96x16_F32F16F16_RS
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %53, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n96k16.f32.f16.f16 "
+      "wgmma.mma_async.sync.aligned.m64n96k16.f32.bf16.bf16 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -3059,27 +2791,26 @@ struct SM90_64x96x16_F32F16F16_RS
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x96x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x96x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x112x16 F32+=F16*F16
+// GMMA 64x128x16 F32+=BF16*BF16
 template <
   GMMA::Major tnspA,
   GMMA::Major tnspB,
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x112x16_F32F16F16_SS
+struct MMA_64x128x16_F32BF16BF16_SS
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[56];
+  using CRegisters = float[64];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
@@ -3098,24 +2829,28 @@ struct SM90_64x112x16_F32F16F16_SS
       float         & d44, float         & d45, float         & d46, float         & d47,
       float         & d48, float         & d49, float         & d50, float         & d51,
       float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %58, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n112k16.f32.f16.f16 "
+      "setp.ne.b32 p, %66, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n128k16.f32.bf16.bf16 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
       " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
       " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
-      " %56,"
-      " %57,"
-      " p,    %59,  %60,  %61,  %62;\n"
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      " %64,"
+      " %65,"
+      " p,    %67,  %68,  %69,  %70;\n"
     "}\n"
       : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
         "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
@@ -3130,33 +2865,33 @@ struct SM90_64x112x16_F32F16F16_SS
         "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
         "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
         "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x112x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x128x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x112x16 F32+=F16*F16
+// GMMA 64x128x16 F32+=BF16*BF16
 template <
   GMMA::Major tnspA,
   GMMA::Major tnspB,
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x112x16_F32F16F16_RS
+struct MMA_64x128x16_F32BF16BF16_RS
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[56];
+  using CRegisters = float[64];
 
   static_assert(tnspA == GMMA::Major::K,
       "Register source operand A must have K major layout.");
@@ -3178,24 +2913,28 @@ struct SM90_64x112x16_F32F16F16_RS
       float         & d44, float         & d45, float         & d46, float         & d47,
       float         & d48, float         & d49, float         & d50, float         & d51,
       float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %61, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n112k16.f32.f16.f16 "
+      "setp.ne.b32 p, %69, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n128k16.f32.bf16.bf16 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
       " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
       " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
-      "{%56,  %57,  %58,  %59},"
-      " %60,"
-      " p,    %62,  %63,  %64;\n"
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      "{%64,  %65,  %66,  %67},"
+      " %68,"
+      " p,    %70,  %71,  %72;\n"
     "}\n"
       : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
         "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
@@ -3210,32 +2949,33 @@ struct SM90_64x112x16_F32F16F16_RS
         "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
         "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
         "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x112x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x128x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x128x16 F32+=F16*F16
+// GMMA 64x192x16 F32+=BF16*BF16
 template <
   GMMA::Major tnspA,
   GMMA::Major tnspB,
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x128x16_F32F16F16_SS
+struct MMA_64x192x16_F32BF16BF16_SS
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[64];
+  using CRegisters = float[96];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
@@ -3256,14 +2996,23 @@ struct SM90_64x128x16_F32F16F16_SS
       float         & d52, float         & d53, float         & d54, float         & d55,
       float         & d56, float         & d57, float         & d58, float         & d59,
       float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      float         & d92, float         & d93, float         & d94, float         & d95,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %66, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 "
+      "setp.ne.b32 p, %98, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n192k16.f32.bf16.bf16 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -3271,10 +3020,14 @@ struct SM90_64x128x16_F32F16F16_SS
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
       " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
       " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
-      " %64,"
-      " %65,"
-      " p,    %67,  %68,  %69,  %70;\n"
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      " %96,"
+      " %97,"
+      " p,    %99,  %100, %101, %102;\n"
     "}\n"
       : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
         "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
@@ -3291,31 +3044,39 @@ struct SM90_64x128x16_F32F16F16_SS
         "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
         "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
         "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91),
+        "+f"(d92), "+f"(d93), "+f"(d94), "+f"(d95)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x128x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x192x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x128x16 F32+=F16*F16
+// GMMA 64x192x16 F32+=BF16*BF16
 template <
   GMMA::Major tnspA,
   GMMA::Major tnspB,
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x128x16_F32F16F16_RS
+struct MMA_64x192x16_F32BF16BF16_RS
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[64];
+  using CRegisters = float[96];
 
   static_assert(tnspA == GMMA::Major::K,
       "Register source operand A must have K major layout.");
@@ -3339,14 +3100,23 @@ struct SM90_64x128x16_F32F16F16_RS
       float         & d52, float         & d53, float         & d54, float         & d55,
       float         & d56, float         & d57, float         & d58, float         & d59,
       float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      float         & d92, float         & d93, float         & d94, float         & d95,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %69, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n128k16.f32.f16.f16 "
+      "setp.ne.b32 p, %101, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n192k16.f32.bf16.bf16 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -3354,10 +3124,14 @@ struct SM90_64x128x16_F32F16F16_RS
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
       " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
       " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
-      "{%64,  %65,  %66,  %67},"
-      " %68,"
-      " p,    %70,  %71,  %72;\n"
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      "{%96,  %97,  %98,  %99},"
+      " %100,"
+      " p,    %102, %103, %104;\n"
     "}\n"
       : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
         "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
@@ -3374,62 +3148,84 @@ struct SM90_64x128x16_F32F16F16_RS
         "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
         "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
         "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91),
+        "+f"(d92), "+f"(d93), "+f"(d94), "+f"(d95)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x128x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x192x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x144x16 F32+=F16*F16
+// GMMA 64x256x16 F32+=BF16*BF16
 template <
   GMMA::Major tnspA,
   GMMA::Major tnspB,
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x144x16_F32F16F16_SS
+struct MMA_64x256x16_F32BF16BF16_SS
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[72];
+  using CRegisters = float[128];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      float         & d124, float         & d125, float         & d126, float         & d127,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %74, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n144k16.f32.f16.f16 "
+      "setp.ne.b32 p, %130, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n256k16.f32.bf16.bf16 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -3438,88 +3234,122 @@ struct SM90_64x144x16_F32F16F16_SS
       " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
       " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
       " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
-      " %72,"
-      " %73,"
-      " p,    %75,  %76,  %77,  %78;\n"
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      " %128,"
+      " %129,"
+      " p,    %131, %132, %133, %134;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71)
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123),
+        "+f"(d124), "+f"(d125), "+f"(d126), "+f"(d127)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x144x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x256x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x144x16 F32+=F16*F16
+// GMMA 64x256x16 F32+=BF16*BF16
 template <
   GMMA::Major tnspA,
   GMMA::Major tnspB,
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x144x16_F32F16F16_RS
+struct MMA_64x256x16_F32BF16BF16_RS
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[72];
+  using CRegisters = float[128];
 
   static_assert(tnspA == GMMA::Major::K,
       "Register source operand A must have K major layout.");
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      float         & d124, float         & d125, float         & d126, float         & d127,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %77, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n144k16.f32.f16.f16 "
+      "setp.ne.b32 p, %133, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n256k16.f32.bf16.bf16 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -3528,433 +3358,751 @@ struct SM90_64x144x16_F32F16F16_RS
       " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
       " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
       " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
-      "{%72,  %73,  %74,  %75},"
-      " %76,"
-      " p,    %78,  %79,  %80;\n"
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      "{%128, %129, %130, %131},"
+      " %132,"
+      " p,    %134, %135, %136;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123),
+        "+f"(d124), "+f"(d125), "+f"(d126), "+f"(d127)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x144x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x256x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x160x16 F32+=F16*F16
+// GMMA 64x8x8 TN F32+=TF32*TF32
 template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x160x16_F32F16F16_SS
+struct MMA_64x8x8_F32TF32TF32_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[80];
+  using CRegisters = float[4];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      float         & d72, float         & d73, float         & d74, float         & d75,
-      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d0, float         & d1, float         & d2, float         & d3,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %82, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n160k16.f32.f16.f16 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
-      " %80,"
-      " %81,"
-      " p,    %83,  %84,  %85,  %86;\n"
+      "setp.ne.b32 p, %6, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n8k8.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3},"
+      " %4,"
+      " %5,"
+      " p,   %7,  %8;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
-        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
-        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79)
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3)
       :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x160x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x8x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x160x16 F32+=F16*F16
+// GMMA 64x8x8 TN F32+=TF32*TF32
 template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x160x16_F32F16F16_RS
+struct MMA_64x8x8_F32TF32TF32_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[80];
-
-  static_assert(tnspA == GMMA::Major::K,
-      "Register source operand A must have K major layout.");
+  using CRegisters = float[4];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      float         & d72, float         & d73, float         & d74, float         & d75,
-      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d0, float         & d1, float         & d2, float         & d3,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %85, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n160k16.f32.f16.f16 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
-      "{%80,  %81,  %82,  %83},"
-      " %84,"
-      " p,    %86,  %87,  %88;\n"
+      "setp.ne.b32 p, %9, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n8k8.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3},"
+      "{%4,  %5,  %6,  %7},"
+      " %8,"
+      " p,   %10, %11;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
-        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
-        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x160x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x8x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x176x16 F32+=F16*F16
+// GMMA 64x16x8 TN F32+=TF32*TF32
 template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x176x16_F32F16F16_SS
+struct MMA_64x16x8_F32TF32TF32_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[88];
+  using CRegisters = float[8];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      float         & d72, float         & d73, float         & d74, float         & d75,
-      float         & d76, float         & d77, float         & d78, float         & d79,
-      float         & d80, float         & d81, float         & d82, float         & d83,
-      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d0, float         & d1, float         & d2, float         & d3,
+      float         & d4, float         & d5, float         & d6, float         & d7,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %90, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n176k16.f32.f16.f16 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
-      " %88,"
-      " %89,"
-      " p,    %91,  %92,  %93,  %94;\n"
+      "setp.ne.b32 p, %10, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n16k8.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      " %8,"
+      " %9,"
+      " p,   %11, %12;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
-        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
-        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
-        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
-        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87)
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3),
+        "+f"(d4), "+f"(d5), "+f"(d6), "+f"(d7)
       :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x176x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x16x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x176x16 F32+=F16*F16
+// GMMA 64x16x8 TN F32+=TF32*TF32
 template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x176x16_F32F16F16_RS
+struct MMA_64x16x8_F32TF32TF32_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[88];
-
-  static_assert(tnspA == GMMA::Major::K,
-      "Register source operand A must have K major layout.");
+  using CRegisters = float[8];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      float         & d72, float         & d73, float         & d74, float         & d75,
-      float         & d76, float         & d77, float         & d78, float         & d79,
-      float         & d80, float         & d81, float         & d82, float         & d83,
-      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d0, float         & d1, float         & d2, float         & d3,
+      float         & d4, float         & d5, float         & d6, float         & d7,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %93, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n176k16.f32.f16.f16 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
-      "{%88,  %89,  %90,  %91},"
-      " %92,"
-      " p,    %94,  %95,  %96;\n"
+      "setp.ne.b32 p, %13, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n16k8.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      "{%8,  %9,  %10, %11},"
+      " %12,"
+      " p,   %14, %15;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
-        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
-        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
-        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
-        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3),
+        "+f"(d4), "+f"(d5), "+f"(d6), "+f"(d7)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x176x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x16x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x192x16 F32+=F16*F16
+// GMMA 64x32x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x32x8_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[16];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %18, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n32k8.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      " %16,"
+      " %17,"
+      " p,   %19, %20;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x32x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x32x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x32x8_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[16];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %21, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n32k8.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      "{%16, %17, %18, %19},"
+      " %20,"
+      " p,   %22, %23;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x32x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x64x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x64x8_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[32];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %34, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n64k8.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      " %32,"
+      " %33,"
+      " p,   %35, %36;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x64x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x64x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x64x8_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[32];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %37, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n64k8.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      "{%32, %33, %34, %35},"
+      " %36,"
+      " p,   %38, %39;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x64x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x96x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x96x8_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[48];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %50, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n96k8.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45, %46, %47},"
+      " %48,"
+      " %49,"
+      " p,   %51, %52;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x96x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x96x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x96x8_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[48];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %53, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n96k8.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
+      "{%48,  %49,  %50,  %51},"
+      " %52,"
+      " p,    %54,  %55;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x96x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x128x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x128x8_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[64];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %66, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n128k8.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      " %64,"
+      " %65,"
+      " p,    %67,  %68;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x128x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x128x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x128x8_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[64];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %69, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n128k8.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      "{%64,  %65,  %66,  %67},"
+      " %68,"
+      " p,    %70,  %71;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x128x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x192x8 TN F32+=TF32*TF32
 template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x192x16_F32F16F16_SS
+struct MMA_64x192x8_F32TF32TF32_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
@@ -3991,11 +4139,12 @@ struct SM90_64x192x16_F32F16F16_SS
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %98, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n192k16.f32.f16.f16 "
+      "wgmma.mma_async.sync.aligned.m64n192k8.f32.tf32.tf32 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -4010,7 +4159,7 @@ struct SM90_64x192x16_F32F16F16_SS
       " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
       " %96,"
       " %97,"
-      " p,    %99,  %100, %101, %102;\n"
+      " p,    %99,  %100;\n"
     "}\n"
       : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
         "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
@@ -4038,32 +4187,27 @@ struct SM90_64x192x16_F32F16F16_SS
         "+f"(d92), "+f"(d93), "+f"(d94), "+f"(d95)
       :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x192x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x192x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x192x16 F32+=F16*F16
+// GMMA 64x192x8 TN F32+=TF32*TF32
 template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x192x16_F32F16F16_RS
+struct MMA_64x192x8_F32TF32TF32_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
   using CRegisters = float[96];
 
-  static_assert(tnspA == GMMA::Major::K,
-      "Register source operand A must have K major layout.");
-
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
       uint64_t const& desc_b,
@@ -4094,11 +4238,12 @@ struct SM90_64x192x16_F32F16F16_RS
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %101, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n192k16.f32.f16.f16 "
+      "wgmma.mma_async.sync.aligned.m64n192k8.f32.tf32.tf32 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -4113,7 +4258,7 @@ struct SM90_64x192x16_F32F16F16_RS
       " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
       "{%96,  %97,  %98,  %99},"
       " %100,"
-      " p,    %102, %103, %104;\n"
+      " p,    %102, %103;\n"
     "}\n"
       : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
         "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
@@ -4141,29 +4286,26 @@ struct SM90_64x192x16_F32F16F16_RS
         "+f"(d92), "+f"(d93), "+f"(d94), "+f"(d95)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x192x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x192x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x208x16 F32+=F16*F16
+// GMMA 64x256x8 TN F32+=TF32*TF32
 template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x208x16_F32F16F16_SS
+struct MMA_64x256x8_F32TF32TF32_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[104];
+  using CRegisters = float[128];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
@@ -4194,14 +4336,21 @@ struct SM90_64x208x16_F32F16F16_SS
       float         & d092, float         & d093, float         & d094, float         & d095,
       float         & d096, float         & d097, float         & d098, float         & d099,
       float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      float         & d124, float         & d125, float         & d126, float         & d127,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %106, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n208k16.f32.f16.f16 "
+      "setp.ne.b32 p, %130, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n256k8.f32.tf32.tf32 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -4214,10 +4363,13 @@ struct SM90_64x208x16_F32F16F16_SS
       " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
       " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
       " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
-      " %104,"
-      " %105,"
-      " p,    %107, %108, %109, %110;\n"
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      " %128,"
+      " %129,"
+      " p,    %131, %132;\n"
     "}\n"
       : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
         "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
@@ -4244,36 +4396,35 @@ struct SM90_64x208x16_F32F16F16_SS
         "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
         "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
         "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103)
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123),
+        "+f"(d124), "+f"(d125), "+f"(d126), "+f"(d127)
       :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x208x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x256x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x208x16 F32+=F16*F16
+// GMMA 64x256x8 TN F32+=TF32*TF32
 template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x208x16_F32F16F16_RS
+struct MMA_64x256x8_F32TF32TF32_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[104];
-
-  static_assert(tnspA == GMMA::Major::K,
-      "Register source operand A must have K major layout.");
+  using CRegisters = float[128];
 
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
@@ -4304,14 +4455,21 @@ struct SM90_64x208x16_F32F16F16_RS
       float         & d092, float         & d093, float         & d094, float         & d095,
       float         & d096, float         & d097, float         & d098, float         & d099,
       float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      float         & d124, float         & d125, float         & d126, float         & d127,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %109, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n208k16.f32.f16.f16 "
+      "setp.ne.b32 p, %133, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n256k8.f32.tf32.tf32 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -4324,10 +4482,13 @@ struct SM90_64x208x16_F32F16F16_RS
       " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
       " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
       " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
-      "{%104, %105, %106, %107},"
-      " %108,"
-      " p,    %110, %111, %112;\n"
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      "{%128, %129, %130, %131},"
+      " %132,"
+      " p,    %134, %135;\n"
     "}\n"
       : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
         "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
@@ -4354,27428 +4515,667 @@ struct SM90_64x208x16_F32F16F16_RS
         "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
         "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
         "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103)
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123),
+        "+f"(d124), "+f"(d125), "+f"(d126), "+f"(d127)
       :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x208x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x256x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x224x16 F32+=F16*F16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x224x16_F32F16F16_SS
+// GMMA 64x8x32 TN S32+=S8*S8
+struct MMA_64x8x32_S32S8S8_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[112];
+  using CRegisters = uint32_t[4];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
-      float         & d104, float         & d105, float         & d106, float         & d107,
-      float         & d108, float         & d109, float         & d110, float         & d111,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %114, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n224k16.f32.f16.f16 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111},"
-      " %112,"
-      " %113,"
-      " p,    %115, %116, %117, %118;\n"
+      "setp.ne.b32 p, %6, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n8k32.s32.s8.s8 "
+      "{%0,  %1,  %2,  %3},"
+      " %4,"
+      " %5,"
+      " p;\n"
     "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
-        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
-        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111)
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
       :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x224x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x8x32_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x224x16 F32+=F16*F16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x224x16_F32F16F16_RS
+// GMMA 64x8x32 TN S32+=S8*S8
+struct MMA_64x8x32_S32S8S8_SS_TN_SATURATE
 {
   using DRegisters = void;
-  using ARegisters = uint32_t[4];
+  using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[112];
-
-  static_assert(tnspA == GMMA::Major::K,
-      "Register source operand A must have K major layout.");
+  using CRegisters = uint32_t[4];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+  fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
-      float         & d104, float         & d105, float         & d106, float         & d107,
-      float         & d108, float         & d109, float         & d110, float         & d111,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %117, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n224k16.f32.f16.f16 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111},"
-      "{%112, %113, %114, %115},"
-      " %116,"
-      " p,    %118, %119, %120;\n"
+      "setp.ne.b32 p, %6, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n8k32.s32.s8.s8.satfinite "
+      "{%0,  %1,  %2,  %3},"
+      " %4,"
+      " %5,"
+      " p;\n"
     "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
-        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
-        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
+      :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x224x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x8x32_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x240x16 F32+=F16*F16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x240x16_F32F16F16_SS
+// GMMA 64x16x32 TN S32+=S8*S8
+struct MMA_64x16x32_S32S8S8_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[120];
+  using CRegisters = uint32_t[8];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
-      float         & d104, float         & d105, float         & d106, float         & d107,
-      float         & d108, float         & d109, float         & d110, float         & d111,
-      float         & d112, float         & d113, float         & d114, float         & d115,
-      float         & d116, float         & d117, float         & d118, float         & d119,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %122, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n240k16.f32.f16.f16 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119},"
-      " %120,"
-      " %121,"
-      " p,    %123, %124, %125, %126;\n"
+      "setp.ne.b32 p, %10, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n16k32.s32.s8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      " %8,"
+      " %9,"
+      " p;\n"
     "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
-        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
-        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
-        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
-        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119)
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
       :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x240x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x16x32_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x240x16 F32+=F16*F16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x240x16_F32F16F16_RS
+// GMMA 64x16x32 TN S32+=S8*S8
+struct MMA_64x16x32_S32S8S8_SS_TN_SATURATE
 {
   using DRegisters = void;
-  using ARegisters = uint32_t[4];
+  using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[120];
-
-  static_assert(tnspA == GMMA::Major::K,
-      "Register source operand A must have K major layout.");
+  using CRegisters = uint32_t[8];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+  fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
-      float         & d104, float         & d105, float         & d106, float         & d107,
-      float         & d108, float         & d109, float         & d110, float         & d111,
-      float         & d112, float         & d113, float         & d114, float         & d115,
-      float         & d116, float         & d117, float         & d118, float         & d119,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %125, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n240k16.f32.f16.f16 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119},"
-      "{%120, %121, %122, %123},"
-      " %124,"
-      " p,    %126, %127, %128;\n"
+      "setp.ne.b32 p, %10, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n16k32.s32.s8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      " %8,"
+      " %9,"
+      " p;\n"
     "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
-        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
-        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
-        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
-        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
+      :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x240x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x16x32_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x256x16 F32+=F16*F16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x256x16_F32F16F16_SS
+// GMMA 64x32x32 TN S32+=S8*S8
+struct MMA_64x32x32_S32S8S8_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[128];
+  using CRegisters = uint32_t[16];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
-      float         & d104, float         & d105, float         & d106, float         & d107,
-      float         & d108, float         & d109, float         & d110, float         & d111,
-      float         & d112, float         & d113, float         & d114, float         & d115,
-      float         & d116, float         & d117, float         & d118, float         & d119,
-      float         & d120, float         & d121, float         & d122, float         & d123,
-      float         & d124, float         & d125, float         & d126, float         & d127,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %130, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n256k16.f32.f16.f16 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119, "
-      " %120, %121, %122, %123, %124, %125, %126, %127},"
-      " %128,"
-      " %129,"
-      " p,    %131, %132, %133, %134;\n"
-    "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
-        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
-        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
-        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
-        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
-        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123),
-        "+f"(d124), "+f"(d125), "+f"(d126), "+f"(d127)
+      "setp.ne.b32 p, %18, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n32k32.s32.s8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      " %16,"
+      " %17,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
       :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x256x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x32x32_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x256x16 F32+=F16*F16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x256x16_F32F16F16_RS
+// GMMA 64x32x32 TN S32+=S8*S8
+struct MMA_64x32x32_S32S8S8_SS_TN_SATURATE
 {
   using DRegisters = void;
-  using ARegisters = uint32_t[4];
+  using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[128];
-
-  static_assert(tnspA == GMMA::Major::K,
-      "Register source operand A must have K major layout.");
+  using CRegisters = uint32_t[16];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+  fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
-      float         & d104, float         & d105, float         & d106, float         & d107,
-      float         & d108, float         & d109, float         & d110, float         & d111,
-      float         & d112, float         & d113, float         & d114, float         & d115,
-      float         & d116, float         & d117, float         & d118, float         & d119,
-      float         & d120, float         & d121, float         & d122, float         & d123,
-      float         & d124, float         & d125, float         & d126, float         & d127,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %133, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n256k16.f32.f16.f16 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119, "
-      " %120, %121, %122, %123, %124, %125, %126, %127},"
-      "{%128, %129, %130, %131},"
-      " %132,"
-      " p,    %134, %135, %136;\n"
+      "setp.ne.b32 p, %18, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n32k32.s32.s8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      " %16,"
+      " %17,"
+      " p;\n"
     "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
-        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
-        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
-        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
-        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
-        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123),
-        "+f"(d124), "+f"(d125), "+f"(d126), "+f"(d127)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
+      :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x256x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x32x32_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x8x16 F32+=BF16*BF16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x8x16_F32BF16BF16_SS
+// GMMA 64x64x32 TN S32+=S8*S8
+struct MMA_64x64x32_S32S8S8_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[4];
+  using CRegisters = uint32_t[32];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d0, float         & d1, float         & d2, float         & d3,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %6, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n8k16.f32.bf16.bf16 "
-      "{%0,  %1,  %2,  %3},"
-      " %4,"
-      " %5,"
-      " p,   %7,  %8,  %9,  %10;\n"
+      "setp.ne.b32 p, %34, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n64k32.s32.s8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      " %32,"
+      " %33,"
+      " p;\n"
     "}\n"
-      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3)
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
       :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x8x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x64x32_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x8x16 F32+=BF16*BF16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x8x16_F32BF16BF16_RS
+// GMMA 64x64x32 TN S32+=S8*S8
+struct MMA_64x64x32_S32S8S8_SS_TN_SATURATE
 {
   using DRegisters = void;
-  using ARegisters = uint32_t[4];
+  using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[4];
-
-  static_assert(tnspA == GMMA::Major::K,
-      "Register source operand A must have K major layout.");
+  using CRegisters = uint32_t[32];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+  fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d0, float         & d1, float         & d2, float         & d3,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %9, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n8k16.f32.bf16.bf16 "
-      "{%0,  %1,  %2,  %3},"
-      "{%4,  %5,  %6,  %7},"
-      " %8,"
-      " p,   %10, %11, %12;\n"
+      "setp.ne.b32 p, %34, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n64k32.s32.s8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      " %32,"
+      " %33,"
+      " p;\n"
     "}\n"
-      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3)
-      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
+      :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x8x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x64x32_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x16x16 F32+=BF16*BF16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x16x16_F32BF16BF16_SS
+// GMMA 64x96x32 TN S32+=S8*S8
+struct MMA_64x96x32_S32S8S8_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[8];
+  using CRegisters = uint32_t[48];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d0, float         & d1, float         & d2, float         & d3,
-      float         & d4, float         & d5, float         & d6, float         & d7,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %10, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n16k16.f32.bf16.bf16 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
-      " %8,"
-      " %9,"
-      " p,   %11, %12, %13, %14;\n"
+      "setp.ne.b32 p, %50, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n96k32.s32.s8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45, %46, %47},"
+      " %48,"
+      " %49,"
+      " p;\n"
     "}\n"
-      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3),
-        "+f"(d4), "+f"(d5), "+f"(d6), "+f"(d7)
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
       :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x16x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x96x32_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x16x16 F32+=BF16*BF16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x16x16_F32BF16BF16_RS
+// GMMA 64x96x32 TN S32+=S8*S8
+struct MMA_64x96x32_S32S8S8_SS_TN_SATURATE
 {
   using DRegisters = void;
-  using ARegisters = uint32_t[4];
+  using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[8];
-
-  static_assert(tnspA == GMMA::Major::K,
-      "Register source operand A must have K major layout.");
+  using CRegisters = uint32_t[48];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+  fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d0, float         & d1, float         & d2, float         & d3,
-      float         & d4, float         & d5, float         & d6, float         & d7,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %13, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n16k16.f32.bf16.bf16 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
-      "{%8,  %9,  %10, %11},"
-      " %12,"
-      " p,   %14, %15, %16;\n"
+      "setp.ne.b32 p, %50, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n96k32.s32.s8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45, %46, %47},"
+      " %48,"
+      " %49,"
+      " p;\n"
     "}\n"
-      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3),
-        "+f"(d4), "+f"(d5), "+f"(d6), "+f"(d7)
-      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
+      :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x16x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x96x32_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x32x16 F32+=BF16*BF16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x32x16_F32BF16BF16_SS
+// GMMA 64x128x32 TN S32+=S8*S8
+struct MMA_64x128x32_S32S8S8_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[16];
+  using CRegisters = uint32_t[64];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %18, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n32k16.f32.bf16.bf16 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
-      " %16,"
-      " %17,"
-      " p,   %19, %20, %21, %22;\n"
+      "setp.ne.b32 p, %66, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n128k32.s32.s8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      " %64,"
+      " %65,"
+      " p;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15)
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
       :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x32x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x128x32_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x32x16 F32+=BF16*BF16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x32x16_F32BF16BF16_RS
+// GMMA 64x128x32 TN S32+=S8*S8
+struct MMA_64x128x32_S32S8S8_SS_TN_SATURATE
 {
   using DRegisters = void;
-  using ARegisters = uint32_t[4];
+  using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[16];
-
-  static_assert(tnspA == GMMA::Major::K,
-      "Register source operand A must have K major layout.");
+  using CRegisters = uint32_t[64];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %21, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n32k16.f32.bf16.bf16 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
-      "{%16, %17, %18, %19},"
-      " %20,"
-      " p,   %22, %23, %24;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x32x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x48x16 F32+=BF16*BF16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x48x16_F32BF16BF16_SS
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[24];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %26, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n48k16.f32.bf16.bf16 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23},"
-      " %24,"
-      " %25,"
-      " p,   %27, %28, %29, %30;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x48x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x48x16 F32+=BF16*BF16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x48x16_F32BF16BF16_RS
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[24];
-
-  static_assert(tnspA == GMMA::Major::K,
-      "Register source operand A must have K major layout.");
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %29, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n48k16.f32.bf16.bf16 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23},"
-      "{%24, %25, %26, %27},"
-      " %28,"
-      " p,   %30, %31, %32;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x48x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x64x16 F32+=BF16*BF16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x64x16_F32BF16BF16_SS
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[32];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %34, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n64k16.f32.bf16.bf16 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31},"
-      " %32,"
-      " %33,"
-      " p,   %35, %36, %37, %38;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x64x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x64x16 F32+=BF16*BF16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x64x16_F32BF16BF16_RS
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[32];
-
-  static_assert(tnspA == GMMA::Major::K,
-      "Register source operand A must have K major layout.");
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %37, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n64k16.f32.bf16.bf16 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31},"
-      "{%32, %33, %34, %35},"
-      " %36,"
-      " p,   %38, %39, %40;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x64x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x80x16 F32+=BF16*BF16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x80x16_F32BF16BF16_SS
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[40];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %42, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n80k16.f32.bf16.bf16 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39},"
-      " %40,"
-      " %41,"
-      " p,   %43, %44, %45, %46;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x80x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x80x16 F32+=BF16*BF16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x80x16_F32BF16BF16_RS
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[40];
-
-  static_assert(tnspA == GMMA::Major::K,
-      "Register source operand A must have K major layout.");
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %45, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n80k16.f32.bf16.bf16 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39},"
-      "{%40, %41, %42, %43},"
-      " %44,"
-      " p,   %46, %47, %48;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x80x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x96x16 F32+=BF16*BF16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x96x16_F32BF16BF16_SS
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[48];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %50, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n96k16.f32.bf16.bf16 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39, "
-      " %40, %41, %42, %43, %44, %45, %46, %47},"
-      " %48,"
-      " %49,"
-      " p,   %51, %52, %53, %54;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x96x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x96x16 F32+=BF16*BF16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x96x16_F32BF16BF16_RS
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[48];
-
-  static_assert(tnspA == GMMA::Major::K,
-      "Register source operand A must have K major layout.");
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %53, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n96k16.f32.bf16.bf16 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
-      "{%48,  %49,  %50,  %51},"
-      " %52,"
-      " p,    %54,  %55,  %56;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x96x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x112x16 F32+=BF16*BF16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x112x16_F32BF16BF16_SS
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[56];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %58, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n112k16.f32.bf16.bf16 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
-      " %56,"
-      " %57,"
-      " p,    %59,  %60,  %61,  %62;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x112x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x112x16 F32+=BF16*BF16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x112x16_F32BF16BF16_RS
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[56];
-
-  static_assert(tnspA == GMMA::Major::K,
-      "Register source operand A must have K major layout.");
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %61, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n112k16.f32.bf16.bf16 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
-      "{%56,  %57,  %58,  %59},"
-      " %60,"
-      " p,    %62,  %63,  %64;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x112x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x128x16 F32+=BF16*BF16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x128x16_F32BF16BF16_SS
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[64];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %66, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n128k16.f32.bf16.bf16 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
-      " %64,"
-      " %65,"
-      " p,    %67,  %68,  %69,  %70;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x128x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x128x16 F32+=BF16*BF16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x128x16_F32BF16BF16_RS
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[64];
-
-  static_assert(tnspA == GMMA::Major::K,
-      "Register source operand A must have K major layout.");
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %69, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n128k16.f32.bf16.bf16 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
-      "{%64,  %65,  %66,  %67},"
-      " %68,"
-      " p,    %70,  %71,  %72;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x128x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x144x16 F32+=BF16*BF16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x144x16_F32BF16BF16_SS
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[72];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %74, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n144k16.f32.bf16.bf16 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
-      " %72,"
-      " %73,"
-      " p,    %75,  %76,  %77,  %78;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x144x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x144x16 F32+=BF16*BF16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x144x16_F32BF16BF16_RS
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[72];
-
-  static_assert(tnspA == GMMA::Major::K,
-      "Register source operand A must have K major layout.");
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %77, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n144k16.f32.bf16.bf16 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
-      "{%72,  %73,  %74,  %75},"
-      " %76,"
-      " p,    %78,  %79,  %80;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x144x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x160x16 F32+=BF16*BF16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x160x16_F32BF16BF16_SS
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[80];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      float         & d72, float         & d73, float         & d74, float         & d75,
-      float         & d76, float         & d77, float         & d78, float         & d79,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %82, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n160k16.f32.bf16.bf16 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
-      " %80,"
-      " %81,"
-      " p,    %83,  %84,  %85,  %86;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
-        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
-        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x160x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x160x16 F32+=BF16*BF16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x160x16_F32BF16BF16_RS
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[80];
-
-  static_assert(tnspA == GMMA::Major::K,
-      "Register source operand A must have K major layout.");
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      float         & d72, float         & d73, float         & d74, float         & d75,
-      float         & d76, float         & d77, float         & d78, float         & d79,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %85, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n160k16.f32.bf16.bf16 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
-      "{%80,  %81,  %82,  %83},"
-      " %84,"
-      " p,    %86,  %87,  %88;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
-        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
-        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x160x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x176x16 F32+=BF16*BF16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x176x16_F32BF16BF16_SS
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[88];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      float         & d72, float         & d73, float         & d74, float         & d75,
-      float         & d76, float         & d77, float         & d78, float         & d79,
-      float         & d80, float         & d81, float         & d82, float         & d83,
-      float         & d84, float         & d85, float         & d86, float         & d87,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %90, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n176k16.f32.bf16.bf16 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
-      " %88,"
-      " %89,"
-      " p,    %91,  %92,  %93,  %94;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
-        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
-        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
-        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
-        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x176x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x176x16 F32+=BF16*BF16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x176x16_F32BF16BF16_RS
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[88];
-
-  static_assert(tnspA == GMMA::Major::K,
-      "Register source operand A must have K major layout.");
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      float         & d72, float         & d73, float         & d74, float         & d75,
-      float         & d76, float         & d77, float         & d78, float         & d79,
-      float         & d80, float         & d81, float         & d82, float         & d83,
-      float         & d84, float         & d85, float         & d86, float         & d87,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %93, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n176k16.f32.bf16.bf16 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
-      "{%88,  %89,  %90,  %91},"
-      " %92,"
-      " p,    %94,  %95,  %96;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
-        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
-        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
-        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
-        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x176x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x192x16 F32+=BF16*BF16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x192x16_F32BF16BF16_SS
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[96];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      float         & d72, float         & d73, float         & d74, float         & d75,
-      float         & d76, float         & d77, float         & d78, float         & d79,
-      float         & d80, float         & d81, float         & d82, float         & d83,
-      float         & d84, float         & d85, float         & d86, float         & d87,
-      float         & d88, float         & d89, float         & d90, float         & d91,
-      float         & d92, float         & d93, float         & d94, float         & d95,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %98, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n192k16.f32.bf16.bf16 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
-      " %96,"
-      " %97,"
-      " p,    %99,  %100, %101, %102;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
-        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
-        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
-        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
-        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
-        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91),
-        "+f"(d92), "+f"(d93), "+f"(d94), "+f"(d95)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x192x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x192x16 F32+=BF16*BF16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x192x16_F32BF16BF16_RS
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[96];
-
-  static_assert(tnspA == GMMA::Major::K,
-      "Register source operand A must have K major layout.");
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      float         & d72, float         & d73, float         & d74, float         & d75,
-      float         & d76, float         & d77, float         & d78, float         & d79,
-      float         & d80, float         & d81, float         & d82, float         & d83,
-      float         & d84, float         & d85, float         & d86, float         & d87,
-      float         & d88, float         & d89, float         & d90, float         & d91,
-      float         & d92, float         & d93, float         & d94, float         & d95,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %101, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n192k16.f32.bf16.bf16 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
-      "{%96,  %97,  %98,  %99},"
-      " %100,"
-      " p,    %102, %103, %104;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
-        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
-        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
-        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
-        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
-        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91),
-        "+f"(d92), "+f"(d93), "+f"(d94), "+f"(d95)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x192x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x208x16 F32+=BF16*BF16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x208x16_F32BF16BF16_SS
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[104];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %106, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n208k16.f32.bf16.bf16 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
-      " %104,"
-      " %105,"
-      " p,    %107, %108, %109, %110;\n"
-    "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x208x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x208x16 F32+=BF16*BF16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x208x16_F32BF16BF16_RS
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[104];
-
-  static_assert(tnspA == GMMA::Major::K,
-      "Register source operand A must have K major layout.");
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
-      uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %109, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n208k16.f32.bf16.bf16 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
-      "{%104, %105, %106, %107},"
-      " %108,"
-      " p,    %110, %111, %112;\n"
-    "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x208x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x224x16 F32+=BF16*BF16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x224x16_F32BF16BF16_SS
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[112];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
-      float         & d104, float         & d105, float         & d106, float         & d107,
-      float         & d108, float         & d109, float         & d110, float         & d111,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %114, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n224k16.f32.bf16.bf16 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111},"
-      " %112,"
-      " %113,"
-      " p,    %115, %116, %117, %118;\n"
-    "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
-        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
-        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x224x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x224x16 F32+=BF16*BF16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x224x16_F32BF16BF16_RS
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[112];
-
-  static_assert(tnspA == GMMA::Major::K,
-      "Register source operand A must have K major layout.");
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
-      uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
-      float         & d104, float         & d105, float         & d106, float         & d107,
-      float         & d108, float         & d109, float         & d110, float         & d111,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %117, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n224k16.f32.bf16.bf16 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111},"
-      "{%112, %113, %114, %115},"
-      " %116,"
-      " p,    %118, %119, %120;\n"
-    "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
-        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
-        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x224x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x240x16 F32+=BF16*BF16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x240x16_F32BF16BF16_SS
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[120];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
-      float         & d104, float         & d105, float         & d106, float         & d107,
-      float         & d108, float         & d109, float         & d110, float         & d111,
-      float         & d112, float         & d113, float         & d114, float         & d115,
-      float         & d116, float         & d117, float         & d118, float         & d119,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %122, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n240k16.f32.bf16.bf16 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119},"
-      " %120,"
-      " %121,"
-      " p,    %123, %124, %125, %126;\n"
-    "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
-        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
-        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
-        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
-        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x240x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x240x16 F32+=BF16*BF16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x240x16_F32BF16BF16_RS
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[120];
-
-  static_assert(tnspA == GMMA::Major::K,
-      "Register source operand A must have K major layout.");
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
-      uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
-      float         & d104, float         & d105, float         & d106, float         & d107,
-      float         & d108, float         & d109, float         & d110, float         & d111,
-      float         & d112, float         & d113, float         & d114, float         & d115,
-      float         & d116, float         & d117, float         & d118, float         & d119,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %125, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n240k16.f32.bf16.bf16 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119},"
-      "{%120, %121, %122, %123},"
-      " %124,"
-      " p,    %126, %127, %128;\n"
-    "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
-        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
-        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
-        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
-        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x240x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x256x16 F32+=BF16*BF16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x256x16_F32BF16BF16_SS
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[128];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
-      float         & d104, float         & d105, float         & d106, float         & d107,
-      float         & d108, float         & d109, float         & d110, float         & d111,
-      float         & d112, float         & d113, float         & d114, float         & d115,
-      float         & d116, float         & d117, float         & d118, float         & d119,
-      float         & d120, float         & d121, float         & d122, float         & d123,
-      float         & d124, float         & d125, float         & d126, float         & d127,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %130, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n256k16.f32.bf16.bf16 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119, "
-      " %120, %121, %122, %123, %124, %125, %126, %127},"
-      " %128,"
-      " %129,"
-      " p,    %131, %132, %133, %134;\n"
-    "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
-        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
-        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
-        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
-        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
-        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123),
-        "+f"(d124), "+f"(d125), "+f"(d126), "+f"(d127)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x256x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x256x16 F32+=BF16*BF16
-template <
-  GMMA::Major tnspA,
-  GMMA::Major tnspB,
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x256x16_F32BF16BF16_RS
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[128];
-
-  static_assert(tnspA == GMMA::Major::K,
-      "Register source operand A must have K major layout.");
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
-      uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
-      float         & d104, float         & d105, float         & d106, float         & d107,
-      float         & d108, float         & d109, float         & d110, float         & d111,
-      float         & d112, float         & d113, float         & d114, float         & d115,
-      float         & d116, float         & d117, float         & d118, float         & d119,
-      float         & d120, float         & d121, float         & d122, float         & d123,
-      float         & d124, float         & d125, float         & d126, float         & d127,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %133, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n256k16.f32.bf16.bf16 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119, "
-      " %120, %121, %122, %123, %124, %125, %126, %127},"
-      "{%128, %129, %130, %131},"
-      " %132,"
-      " p,    %134, %135, %136;\n"
-    "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
-        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
-        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
-        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
-        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
-        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123),
-        "+f"(d124), "+f"(d125), "+f"(d126), "+f"(d127)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x256x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x8x8 TN F32+=TF32*TF32
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x8x8_F32TF32TF32_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[4];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      float         & d0, float         & d1, float         & d2, float         & d3,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %6, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n8k8.f32.tf32.tf32 "
-      "{%0,  %1,  %2,  %3},"
-      " %4,"
-      " %5,"
-      " p,   %7,  %8;\n"
-    "}\n"
-      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x8x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x8x8 TN F32+=TF32*TF32
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x8x8_F32TF32TF32_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[4];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
-      uint64_t const& desc_b,
-      float         & d0, float         & d1, float         & d2, float         & d3,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %9, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n8k8.f32.tf32.tf32 "
-      "{%0,  %1,  %2,  %3},"
-      "{%4,  %5,  %6,  %7},"
-      " %8,"
-      " p,   %10, %11;\n"
-    "}\n"
-      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3)
-      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x8x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x16x8 TN F32+=TF32*TF32
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x16x8_F32TF32TF32_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[8];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      float         & d0, float         & d1, float         & d2, float         & d3,
-      float         & d4, float         & d5, float         & d6, float         & d7,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %10, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n16k8.f32.tf32.tf32 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
-      " %8,"
-      " %9,"
-      " p,   %11, %12;\n"
-    "}\n"
-      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3),
-        "+f"(d4), "+f"(d5), "+f"(d6), "+f"(d7)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x16x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x16x8 TN F32+=TF32*TF32
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x16x8_F32TF32TF32_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[8];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
-      uint64_t const& desc_b,
-      float         & d0, float         & d1, float         & d2, float         & d3,
-      float         & d4, float         & d5, float         & d6, float         & d7,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %13, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n16k8.f32.tf32.tf32 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
-      "{%8,  %9,  %10, %11},"
-      " %12,"
-      " p,   %14, %15;\n"
-    "}\n"
-      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3),
-        "+f"(d4), "+f"(d5), "+f"(d6), "+f"(d7)
-      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x16x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x32x8 TN F32+=TF32*TF32
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x32x8_F32TF32TF32_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[16];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %18, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n32k8.f32.tf32.tf32 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
-      " %16,"
-      " %17,"
-      " p,   %19, %20;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x32x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x32x8 TN F32+=TF32*TF32
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x32x8_F32TF32TF32_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[16];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %21, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n32k8.f32.tf32.tf32 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
-      "{%16, %17, %18, %19},"
-      " %20,"
-      " p,   %22, %23;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x32x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x48x8 TN F32+=TF32*TF32
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x48x8_F32TF32TF32_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[24];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %26, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n48k8.f32.tf32.tf32 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23},"
-      " %24,"
-      " %25,"
-      " p,   %27, %28;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x48x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x48x8 TN F32+=TF32*TF32
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x48x8_F32TF32TF32_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[24];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %29, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n48k8.f32.tf32.tf32 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23},"
-      "{%24, %25, %26, %27},"
-      " %28,"
-      " p,   %30, %31;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x48x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x64x8 TN F32+=TF32*TF32
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x64x8_F32TF32TF32_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[32];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %34, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n64k8.f32.tf32.tf32 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31},"
-      " %32,"
-      " %33,"
-      " p,   %35, %36;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x64x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x64x8 TN F32+=TF32*TF32
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x64x8_F32TF32TF32_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[32];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %37, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n64k8.f32.tf32.tf32 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31},"
-      "{%32, %33, %34, %35},"
-      " %36,"
-      " p,   %38, %39;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x64x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x80x8 TN F32+=TF32*TF32
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x80x8_F32TF32TF32_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[40];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %42, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n80k8.f32.tf32.tf32 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39},"
-      " %40,"
-      " %41,"
-      " p,   %43, %44;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x80x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x80x8 TN F32+=TF32*TF32
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x80x8_F32TF32TF32_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[40];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %45, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n80k8.f32.tf32.tf32 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39},"
-      "{%40, %41, %42, %43},"
-      " %44,"
-      " p,   %46, %47;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x80x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x96x8 TN F32+=TF32*TF32
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x96x8_F32TF32TF32_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[48];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %50, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n96k8.f32.tf32.tf32 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39, "
-      " %40, %41, %42, %43, %44, %45, %46, %47},"
-      " %48,"
-      " %49,"
-      " p,   %51, %52;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x96x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x96x8 TN F32+=TF32*TF32
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x96x8_F32TF32TF32_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[48];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %53, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n96k8.f32.tf32.tf32 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
-      "{%48,  %49,  %50,  %51},"
-      " %52,"
-      " p,    %54,  %55;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x96x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x112x8 TN F32+=TF32*TF32
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x112x8_F32TF32TF32_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[56];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %58, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n112k8.f32.tf32.tf32 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
-      " %56,"
-      " %57,"
-      " p,    %59,  %60;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x112x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x112x8 TN F32+=TF32*TF32
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x112x8_F32TF32TF32_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[56];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %61, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n112k8.f32.tf32.tf32 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
-      "{%56,  %57,  %58,  %59},"
-      " %60,"
-      " p,    %62,  %63;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x112x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x128x8 TN F32+=TF32*TF32
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x128x8_F32TF32TF32_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[64];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %66, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n128k8.f32.tf32.tf32 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
-      " %64,"
-      " %65,"
-      " p,    %67,  %68;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x128x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x128x8 TN F32+=TF32*TF32
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x128x8_F32TF32TF32_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[64];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %69, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n128k8.f32.tf32.tf32 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
-      "{%64,  %65,  %66,  %67},"
-      " %68,"
-      " p,    %70,  %71;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x128x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x144x8 TN F32+=TF32*TF32
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x144x8_F32TF32TF32_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[72];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %74, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n144k8.f32.tf32.tf32 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
-      " %72,"
-      " %73,"
-      " p,    %75,  %76;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x144x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x144x8 TN F32+=TF32*TF32
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x144x8_F32TF32TF32_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[72];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %77, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n144k8.f32.tf32.tf32 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
-      "{%72,  %73,  %74,  %75},"
-      " %76,"
-      " p,    %78,  %79;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x144x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x160x8 TN F32+=TF32*TF32
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x160x8_F32TF32TF32_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[80];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      float         & d72, float         & d73, float         & d74, float         & d75,
-      float         & d76, float         & d77, float         & d78, float         & d79,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %82, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n160k8.f32.tf32.tf32 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
-      " %80,"
-      " %81,"
-      " p,    %83,  %84;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
-        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
-        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x160x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x160x8 TN F32+=TF32*TF32
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x160x8_F32TF32TF32_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[80];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      float         & d72, float         & d73, float         & d74, float         & d75,
-      float         & d76, float         & d77, float         & d78, float         & d79,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %85, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n160k8.f32.tf32.tf32 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
-      "{%80,  %81,  %82,  %83},"
-      " %84,"
-      " p,    %86,  %87;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
-        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
-        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x160x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x176x8 TN F32+=TF32*TF32
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x176x8_F32TF32TF32_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[88];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      float         & d72, float         & d73, float         & d74, float         & d75,
-      float         & d76, float         & d77, float         & d78, float         & d79,
-      float         & d80, float         & d81, float         & d82, float         & d83,
-      float         & d84, float         & d85, float         & d86, float         & d87,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %90, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n176k8.f32.tf32.tf32 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
-      " %88,"
-      " %89,"
-      " p,    %91,  %92;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
-        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
-        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
-        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
-        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x176x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x176x8 TN F32+=TF32*TF32
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x176x8_F32TF32TF32_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[88];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      float         & d72, float         & d73, float         & d74, float         & d75,
-      float         & d76, float         & d77, float         & d78, float         & d79,
-      float         & d80, float         & d81, float         & d82, float         & d83,
-      float         & d84, float         & d85, float         & d86, float         & d87,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %93, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n176k8.f32.tf32.tf32 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
-      "{%88,  %89,  %90,  %91},"
-      " %92,"
-      " p,    %94,  %95;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
-        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
-        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
-        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
-        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x176x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x192x8 TN F32+=TF32*TF32
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x192x8_F32TF32TF32_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[96];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      float         & d72, float         & d73, float         & d74, float         & d75,
-      float         & d76, float         & d77, float         & d78, float         & d79,
-      float         & d80, float         & d81, float         & d82, float         & d83,
-      float         & d84, float         & d85, float         & d86, float         & d87,
-      float         & d88, float         & d89, float         & d90, float         & d91,
-      float         & d92, float         & d93, float         & d94, float         & d95,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %98, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n192k8.f32.tf32.tf32 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
-      " %96,"
-      " %97,"
-      " p,    %99,  %100;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
-        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
-        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
-        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
-        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
-        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91),
-        "+f"(d92), "+f"(d93), "+f"(d94), "+f"(d95)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x192x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x192x8 TN F32+=TF32*TF32
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x192x8_F32TF32TF32_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[96];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      float         & d72, float         & d73, float         & d74, float         & d75,
-      float         & d76, float         & d77, float         & d78, float         & d79,
-      float         & d80, float         & d81, float         & d82, float         & d83,
-      float         & d84, float         & d85, float         & d86, float         & d87,
-      float         & d88, float         & d89, float         & d90, float         & d91,
-      float         & d92, float         & d93, float         & d94, float         & d95,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %101, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n192k8.f32.tf32.tf32 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
-      "{%96,  %97,  %98,  %99},"
-      " %100,"
-      " p,    %102, %103;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
-        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
-        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
-        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
-        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
-        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91),
-        "+f"(d92), "+f"(d93), "+f"(d94), "+f"(d95)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x192x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x208x8 TN F32+=TF32*TF32
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x208x8_F32TF32TF32_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[104];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %106, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n208k8.f32.tf32.tf32 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
-      " %104,"
-      " %105,"
-      " p,    %107, %108;\n"
-    "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x208x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x208x8 TN F32+=TF32*TF32
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x208x8_F32TF32TF32_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[104];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
-      uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %109, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n208k8.f32.tf32.tf32 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
-      "{%104, %105, %106, %107},"
-      " %108,"
-      " p,    %110, %111;\n"
-    "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x208x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x224x8 TN F32+=TF32*TF32
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x224x8_F32TF32TF32_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[112];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
-      float         & d104, float         & d105, float         & d106, float         & d107,
-      float         & d108, float         & d109, float         & d110, float         & d111,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %114, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n224k8.f32.tf32.tf32 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111},"
-      " %112,"
-      " %113,"
-      " p,    %115, %116;\n"
-    "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
-        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
-        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x224x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x224x8 TN F32+=TF32*TF32
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x224x8_F32TF32TF32_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[112];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
-      uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
-      float         & d104, float         & d105, float         & d106, float         & d107,
-      float         & d108, float         & d109, float         & d110, float         & d111,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %117, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n224k8.f32.tf32.tf32 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111},"
-      "{%112, %113, %114, %115},"
-      " %116,"
-      " p,    %118, %119;\n"
-    "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
-        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
-        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x224x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x240x8 TN F32+=TF32*TF32
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x240x8_F32TF32TF32_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[120];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
-      float         & d104, float         & d105, float         & d106, float         & d107,
-      float         & d108, float         & d109, float         & d110, float         & d111,
-      float         & d112, float         & d113, float         & d114, float         & d115,
-      float         & d116, float         & d117, float         & d118, float         & d119,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %122, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n240k8.f32.tf32.tf32 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119},"
-      " %120,"
-      " %121,"
-      " p,    %123, %124;\n"
-    "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
-        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
-        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
-        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
-        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x240x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x240x8 TN F32+=TF32*TF32
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x240x8_F32TF32TF32_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[120];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
-      uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
-      float         & d104, float         & d105, float         & d106, float         & d107,
-      float         & d108, float         & d109, float         & d110, float         & d111,
-      float         & d112, float         & d113, float         & d114, float         & d115,
-      float         & d116, float         & d117, float         & d118, float         & d119,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %125, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n240k8.f32.tf32.tf32 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119},"
-      "{%120, %121, %122, %123},"
-      " %124,"
-      " p,    %126, %127;\n"
-    "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
-        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
-        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
-        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
-        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x240x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x256x8 TN F32+=TF32*TF32
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x256x8_F32TF32TF32_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[128];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
-      float         & d104, float         & d105, float         & d106, float         & d107,
-      float         & d108, float         & d109, float         & d110, float         & d111,
-      float         & d112, float         & d113, float         & d114, float         & d115,
-      float         & d116, float         & d117, float         & d118, float         & d119,
-      float         & d120, float         & d121, float         & d122, float         & d123,
-      float         & d124, float         & d125, float         & d126, float         & d127,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %130, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n256k8.f32.tf32.tf32 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119, "
-      " %120, %121, %122, %123, %124, %125, %126, %127},"
-      " %128,"
-      " %129,"
-      " p,    %131, %132;\n"
-    "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
-        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
-        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
-        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
-        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
-        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123),
-        "+f"(d124), "+f"(d125), "+f"(d126), "+f"(d127)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x256x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x256x8 TN F32+=TF32*TF32
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x256x8_F32TF32TF32_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[128];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
-      uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
-      float         & d104, float         & d105, float         & d106, float         & d107,
-      float         & d108, float         & d109, float         & d110, float         & d111,
-      float         & d112, float         & d113, float         & d114, float         & d115,
-      float         & d116, float         & d117, float         & d118, float         & d119,
-      float         & d120, float         & d121, float         & d122, float         & d123,
-      float         & d124, float         & d125, float         & d126, float         & d127,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %133, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n256k8.f32.tf32.tf32 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119, "
-      " %120, %121, %122, %123, %124, %125, %126, %127},"
-      "{%128, %129, %130, %131},"
-      " %132,"
-      " p,    %134, %135;\n"
-    "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
-        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
-        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
-        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
-        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
-        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123),
-        "+f"(d124), "+f"(d125), "+f"(d126), "+f"(d127)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x256x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x8x32 TN S32+=S8*S8
-struct SM90_64x8x32_S32S8S8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[4];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %6, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n8k32.s32.s8.s8 "
-      "{%0,  %1,  %2,  %3},"
-      " %4,"
-      " %5,"
-      " p;\n"
-    "}\n"
-      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x8x32_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x8x32 TN S32+=S8*S8
-struct SM90_64x8x32_S32S8S8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[4];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %6, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n8k32.s32.s8.s8.satfinite "
-      "{%0,  %1,  %2,  %3},"
-      " %4,"
-      " %5,"
-      " p;\n"
-    "}\n"
-      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x8x32_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x16x32 TN S32+=S8*S8
-struct SM90_64x16x32_S32S8S8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[8];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
-      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %10, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n16k32.s32.s8.s8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
-      " %8,"
-      " %9,"
-      " p;\n"
-    "}\n"
-      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
-        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x16x32_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x16x32 TN S32+=S8*S8
-struct SM90_64x16x32_S32S8S8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[8];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
-      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %10, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n16k32.s32.s8.s8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
-      " %8,"
-      " %9,"
-      " p;\n"
-    "}\n"
-      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
-        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x16x32_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x32x32 TN S32+=S8*S8
-struct SM90_64x32x32_S32S8S8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[16];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %18, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n32k32.s32.s8.s8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
-      " %16,"
-      " %17,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x32x32_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x32x32 TN S32+=S8*S8
-struct SM90_64x32x32_S32S8S8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[16];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %18, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n32k32.s32.s8.s8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
-      " %16,"
-      " %17,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x32x32_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x48x32 TN S32+=S8*S8
-struct SM90_64x48x32_S32S8S8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[24];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %26, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n48k32.s32.s8.s8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23},"
-      " %24,"
-      " %25,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x48x32_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x48x32 TN S32+=S8*S8
-struct SM90_64x48x32_S32S8S8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[24];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %26, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n48k32.s32.s8.s8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23},"
-      " %24,"
-      " %25,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x48x32_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x64x32 TN S32+=S8*S8
-struct SM90_64x64x32_S32S8S8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[32];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %34, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n64k32.s32.s8.s8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31},"
-      " %32,"
-      " %33,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x64x32_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x64x32 TN S32+=S8*S8
-struct SM90_64x64x32_S32S8S8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[32];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %34, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n64k32.s32.s8.s8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31},"
-      " %32,"
-      " %33,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x64x32_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x80x32 TN S32+=S8*S8
-struct SM90_64x80x32_S32S8S8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[40];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %42, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n80k32.s32.s8.s8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39},"
-      " %40,"
-      " %41,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x80x32_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x80x32 TN S32+=S8*S8
-struct SM90_64x80x32_S32S8S8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[40];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %42, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n80k32.s32.s8.s8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39},"
-      " %40,"
-      " %41,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x80x32_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x96x32 TN S32+=S8*S8
-struct SM90_64x96x32_S32S8S8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[48];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %50, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n96k32.s32.s8.s8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39, "
-      " %40, %41, %42, %43, %44, %45, %46, %47},"
-      " %48,"
-      " %49,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x96x32_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x96x32 TN S32+=S8*S8
-struct SM90_64x96x32_S32S8S8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[48];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %50, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n96k32.s32.s8.s8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39, "
-      " %40, %41, %42, %43, %44, %45, %46, %47},"
-      " %48,"
-      " %49,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x96x32_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x112x32 TN S32+=S8*S8
-struct SM90_64x112x32_S32S8S8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[56];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %58, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n112k32.s32.s8.s8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
-      " %56,"
-      " %57,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x112x32_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x112x32 TN S32+=S8*S8
-struct SM90_64x112x32_S32S8S8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[56];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %58, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n112k32.s32.s8.s8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
-      " %56,"
-      " %57,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x112x32_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x128x32 TN S32+=S8*S8
-struct SM90_64x128x32_S32S8S8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[64];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %66, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n128k32.s32.s8.s8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
-      " %64,"
-      " %65,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x128x32_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x128x32 TN S32+=S8*S8
-struct SM90_64x128x32_S32S8S8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[64];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %66, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n128k32.s32.s8.s8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
-      " %64,"
-      " %65,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x128x32_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x144x32 TN S32+=S8*S8
-struct SM90_64x144x32_S32S8S8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[72];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %74, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n144k32.s32.s8.s8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
-      " %72,"
-      " %73,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x144x32_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x144x32 TN S32+=S8*S8
-struct SM90_64x144x32_S32S8S8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[72];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %74, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n144k32.s32.s8.s8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
-      " %72,"
-      " %73,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x144x32_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x160x32 TN S32+=S8*S8
-struct SM90_64x160x32_S32S8S8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[80];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %82, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n160k32.s32.s8.s8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
-      " %80,"
-      " %81,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x160x32_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x160x32 TN S32+=S8*S8
-struct SM90_64x160x32_S32S8S8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[80];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %82, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n160k32.s32.s8.s8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
-      " %80,"
-      " %81,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x160x32_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x176x32 TN S32+=S8*S8
-struct SM90_64x176x32_S32S8S8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[88];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
-      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %90, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n176k32.s32.s8.s8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
-      " %88,"
-      " %89,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
-        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
-        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x176x32_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x176x32 TN S32+=S8*S8
-struct SM90_64x176x32_S32S8S8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[88];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
-      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %90, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n176k32.s32.s8.s8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
-      " %88,"
-      " %89,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
-        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
-        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x176x32_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x192x32 TN S32+=S8*S8
-struct SM90_64x192x32_S32S8S8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[96];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
-      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
-      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
-      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %98, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n192k32.s32.s8.s8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
-      " %96,"
-      " %97,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
-        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
-        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
-        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
-        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x192x32_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x192x32 TN S32+=S8*S8
-struct SM90_64x192x32_S32S8S8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[96];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
-      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
-      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
-      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %98, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n192k32.s32.s8.s8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
-      " %96,"
-      " %97,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
-        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
-        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
-        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
-        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x192x32_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x208x32 TN S32+=S8*S8
-struct SM90_64x208x32_S32S8S8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[104];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %106, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n208k32.s32.s8.s8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
-      " %104,"
-      " %105,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x208x32_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x208x32 TN S32+=S8*S8
-struct SM90_64x208x32_S32S8S8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[104];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %106, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n208k32.s32.s8.s8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
-      " %104,"
-      " %105,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x208x32_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x224x32 TN S32+=S8*S8
-struct SM90_64x224x32_S32S8S8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[112];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %114, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n224k32.s32.s8.s8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111},"
-      " %112,"
-      " %113,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x224x32_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x224x32 TN S32+=S8*S8
-struct SM90_64x224x32_S32S8S8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[112];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %114, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n224k32.s32.s8.s8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111},"
-      " %112,"
-      " %113,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x224x32_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x240x32 TN S32+=S8*S8
-struct SM90_64x240x32_S32S8S8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[120];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
-      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %122, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n240k32.s32.s8.s8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119},"
-      " %120,"
-      " %121,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
-        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
-        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x240x32_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x240x32 TN S32+=S8*S8
-struct SM90_64x240x32_S32S8S8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[120];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
-      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %122, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n240k32.s32.s8.s8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119},"
-      " %120,"
-      " %121,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
-        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
-        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x240x32_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x256x32 TN S32+=S8*S8
-struct SM90_64x256x32_S32S8S8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[128];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
-      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
-      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
-      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %130, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n256k32.s32.s8.s8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119, "
-      " %120, %121, %122, %123, %124, %125, %126, %127},"
-      " %128,"
-      " %129,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
-        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
-        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
-        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
-        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x256x32_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x256x32 TN S32+=S8*S8
-struct SM90_64x256x32_S32S8S8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[128];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
-      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
-      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
-      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %130, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n256k32.s32.s8.s8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119, "
-      " %120, %121, %122, %123, %124, %125, %126, %127},"
-      " %128,"
-      " %129,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
-        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
-        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
-        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
-        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x256x32_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x8x32 TN S32+=S8*S8
-struct SM90_64x8x32_S32S8S8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[4];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
-      uint64_t const& desc_b,
-      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %9, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n8k32.s32.s8.s8 "
-      "{%0,  %1,  %2,  %3},"
-      "{%4,  %5,  %6,  %7},"
-      " %8,"
-      " p;\n"
-    "}\n"
-      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
-      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x8x32_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x8x32 TN S32+=S8*S8
-struct SM90_64x8x32_S32S8S8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[4];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
-      uint64_t const& desc_b,
-      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %9, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n8k32.s32.s8.s8.satfinite "
-      "{%0,  %1,  %2,  %3},"
-      "{%4,  %5,  %6,  %7},"
-      " %8,"
-      " p;\n"
-    "}\n"
-      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
-      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x8x32_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x16x32 TN S32+=S8*S8
-struct SM90_64x16x32_S32S8S8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[8];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
-      uint64_t const& desc_b,
-      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
-      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %13, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n16k32.s32.s8.s8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
-      "{%8,  %9,  %10, %11},"
-      " %12,"
-      " p;\n"
-    "}\n"
-      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
-        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
-      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x16x32_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x16x32 TN S32+=S8*S8
-struct SM90_64x16x32_S32S8S8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[8];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
-      uint64_t const& desc_b,
-      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
-      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %13, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n16k32.s32.s8.s8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
-      "{%8,  %9,  %10, %11},"
-      " %12,"
-      " p;\n"
-    "}\n"
-      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
-        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
-      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x16x32_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x32x32 TN S32+=S8*S8
-struct SM90_64x32x32_S32S8S8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[16];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %21, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n32k32.s32.s8.s8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
-      "{%16, %17, %18, %19},"
-      " %20,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x32x32_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x32x32 TN S32+=S8*S8
-struct SM90_64x32x32_S32S8S8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[16];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %21, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n32k32.s32.s8.s8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
-      "{%16, %17, %18, %19},"
-      " %20,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x32x32_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x48x32 TN S32+=S8*S8
-struct SM90_64x48x32_S32S8S8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[24];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %29, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n48k32.s32.s8.s8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23},"
-      "{%24, %25, %26, %27},"
-      " %28,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x48x32_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x48x32 TN S32+=S8*S8
-struct SM90_64x48x32_S32S8S8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[24];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %29, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n48k32.s32.s8.s8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23},"
-      "{%24, %25, %26, %27},"
-      " %28,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x48x32_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x64x32 TN S32+=S8*S8
-struct SM90_64x64x32_S32S8S8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[32];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %37, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n64k32.s32.s8.s8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31},"
-      "{%32, %33, %34, %35},"
-      " %36,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x64x32_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x64x32 TN S32+=S8*S8
-struct SM90_64x64x32_S32S8S8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[32];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %37, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n64k32.s32.s8.s8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31},"
-      "{%32, %33, %34, %35},"
-      " %36,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x64x32_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x80x32 TN S32+=S8*S8
-struct SM90_64x80x32_S32S8S8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[40];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %45, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n80k32.s32.s8.s8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39},"
-      "{%40, %41, %42, %43},"
-      " %44,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x80x32_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x80x32 TN S32+=S8*S8
-struct SM90_64x80x32_S32S8S8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[40];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %45, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n80k32.s32.s8.s8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39},"
-      "{%40, %41, %42, %43},"
-      " %44,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x80x32_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x96x32 TN S32+=S8*S8
-struct SM90_64x96x32_S32S8S8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[48];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %53, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n96k32.s32.s8.s8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
-      "{%48,  %49,  %50,  %51},"
-      " %52,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x96x32_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x96x32 TN S32+=S8*S8
-struct SM90_64x96x32_S32S8S8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[48];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %53, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n96k32.s32.s8.s8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
-      "{%48,  %49,  %50,  %51},"
-      " %52,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x96x32_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x112x32 TN S32+=S8*S8
-struct SM90_64x112x32_S32S8S8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[56];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %61, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n112k32.s32.s8.s8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
-      "{%56,  %57,  %58,  %59},"
-      " %60,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x112x32_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x112x32 TN S32+=S8*S8
-struct SM90_64x112x32_S32S8S8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[56];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %61, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n112k32.s32.s8.s8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
-      "{%56,  %57,  %58,  %59},"
-      " %60,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x112x32_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x128x32 TN S32+=S8*S8
-struct SM90_64x128x32_S32S8S8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[64];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %69, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n128k32.s32.s8.s8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
-      "{%64,  %65,  %66,  %67},"
-      " %68,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x128x32_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x128x32 TN S32+=S8*S8
-struct SM90_64x128x32_S32S8S8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[64];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %69, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n128k32.s32.s8.s8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
-      "{%64,  %65,  %66,  %67},"
-      " %68,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x128x32_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x144x32 TN S32+=S8*S8
-struct SM90_64x144x32_S32S8S8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[72];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %77, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n144k32.s32.s8.s8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
-      "{%72,  %73,  %74,  %75},"
-      " %76,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x144x32_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x144x32 TN S32+=S8*S8
-struct SM90_64x144x32_S32S8S8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[72];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %77, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n144k32.s32.s8.s8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
-      "{%72,  %73,  %74,  %75},"
-      " %76,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x144x32_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x160x32 TN S32+=S8*S8
-struct SM90_64x160x32_S32S8S8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[80];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %85, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n160k32.s32.s8.s8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
-      "{%80,  %81,  %82,  %83},"
-      " %84,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x160x32_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x160x32 TN S32+=S8*S8
-struct SM90_64x160x32_S32S8S8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[80];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %85, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n160k32.s32.s8.s8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
-      "{%80,  %81,  %82,  %83},"
-      " %84,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x160x32_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x176x32 TN S32+=S8*S8
-struct SM90_64x176x32_S32S8S8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[88];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
-      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %93, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n176k32.s32.s8.s8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
-      "{%88,  %89,  %90,  %91},"
-      " %92,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
-        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
-        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x176x32_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x176x32 TN S32+=S8*S8
-struct SM90_64x176x32_S32S8S8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[88];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
-      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %93, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n176k32.s32.s8.s8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
-      "{%88,  %89,  %90,  %91},"
-      " %92,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
-        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
-        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x176x32_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x192x32 TN S32+=S8*S8
-struct SM90_64x192x32_S32S8S8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[96];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
-      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
-      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
-      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %101, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n192k32.s32.s8.s8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
-      "{%96,  %97,  %98,  %99},"
-      " %100,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
-        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
-        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
-        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
-        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x192x32_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x192x32 TN S32+=S8*S8
-struct SM90_64x192x32_S32S8S8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[96];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
-      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
-      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
-      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %101, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n192k32.s32.s8.s8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
-      "{%96,  %97,  %98,  %99},"
-      " %100,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
-        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
-        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
-        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
-        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x192x32_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x208x32 TN S32+=S8*S8
-struct SM90_64x208x32_S32S8S8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[104];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %109, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n208k32.s32.s8.s8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
-      "{%104, %105, %106, %107},"
-      " %108,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x208x32_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x208x32 TN S32+=S8*S8
-struct SM90_64x208x32_S32S8S8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[104];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %109, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n208k32.s32.s8.s8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
-      "{%104, %105, %106, %107},"
-      " %108,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x208x32_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x224x32 TN S32+=S8*S8
-struct SM90_64x224x32_S32S8S8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[112];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %117, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n224k32.s32.s8.s8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111},"
-      "{%112, %113, %114, %115},"
-      " %116,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x224x32_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x224x32 TN S32+=S8*S8
-struct SM90_64x224x32_S32S8S8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[112];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %117, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n224k32.s32.s8.s8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111},"
-      "{%112, %113, %114, %115},"
-      " %116,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x224x32_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x240x32 TN S32+=S8*S8
-struct SM90_64x240x32_S32S8S8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[120];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
-      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %125, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n240k32.s32.s8.s8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119},"
-      "{%120, %121, %122, %123},"
-      " %124,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
-        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
-        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x240x32_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x240x32 TN S32+=S8*S8
-struct SM90_64x240x32_S32S8S8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[120];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
-      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %125, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n240k32.s32.s8.s8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119},"
-      "{%120, %121, %122, %123},"
-      " %124,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
-        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
-        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x240x32_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x256x32 TN S32+=S8*S8
-struct SM90_64x256x32_S32S8S8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[128];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
-      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
-      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
-      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %133, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n256k32.s32.s8.s8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119, "
-      " %120, %121, %122, %123, %124, %125, %126, %127},"
-      "{%128, %129, %130, %131},"
-      " %132,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
-        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
-        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
-        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
-        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x256x32_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x256x32 TN S32+=S8*S8
-struct SM90_64x256x32_S32S8S8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[128];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
-      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
-      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
-      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %133, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n256k32.s32.s8.s8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119, "
-      " %120, %121, %122, %123, %124, %125, %126, %127},"
-      "{%128, %129, %130, %131},"
-      " %132,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
-        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
-        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
-        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
-        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x256x32_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x8x32 TN S32+=S8*U8
-struct SM90_64x8x32_S32S8U8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[4];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %6, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n8k32.s32.s8.u8 "
-      "{%0,  %1,  %2,  %3},"
-      " %4,"
-      " %5,"
-      " p;\n"
-    "}\n"
-      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x8x32_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x8x32 TN S32+=S8*U8
-struct SM90_64x8x32_S32S8U8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[4];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %6, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n8k32.s32.s8.u8.satfinite "
-      "{%0,  %1,  %2,  %3},"
-      " %4,"
-      " %5,"
-      " p;\n"
-    "}\n"
-      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x8x32_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x16x32 TN S32+=S8*U8
-struct SM90_64x16x32_S32S8U8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[8];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
-      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %10, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n16k32.s32.s8.u8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
-      " %8,"
-      " %9,"
-      " p;\n"
-    "}\n"
-      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
-        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x16x32_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x16x32 TN S32+=S8*U8
-struct SM90_64x16x32_S32S8U8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[8];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
-      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %10, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n16k32.s32.s8.u8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
-      " %8,"
-      " %9,"
-      " p;\n"
-    "}\n"
-      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
-        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x16x32_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x32x32 TN S32+=S8*U8
-struct SM90_64x32x32_S32S8U8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[16];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %18, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n32k32.s32.s8.u8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
-      " %16,"
-      " %17,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x32x32_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x32x32 TN S32+=S8*U8
-struct SM90_64x32x32_S32S8U8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[16];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %18, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n32k32.s32.s8.u8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
-      " %16,"
-      " %17,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x32x32_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x48x32 TN S32+=S8*U8
-struct SM90_64x48x32_S32S8U8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[24];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %26, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n48k32.s32.s8.u8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23},"
-      " %24,"
-      " %25,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x48x32_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x48x32 TN S32+=S8*U8
-struct SM90_64x48x32_S32S8U8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[24];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %26, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n48k32.s32.s8.u8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23},"
-      " %24,"
-      " %25,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x48x32_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x64x32 TN S32+=S8*U8
-struct SM90_64x64x32_S32S8U8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[32];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %34, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n64k32.s32.s8.u8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31},"
-      " %32,"
-      " %33,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x64x32_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x64x32 TN S32+=S8*U8
-struct SM90_64x64x32_S32S8U8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[32];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %34, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n64k32.s32.s8.u8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31},"
-      " %32,"
-      " %33,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x64x32_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x80x32 TN S32+=S8*U8
-struct SM90_64x80x32_S32S8U8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[40];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %42, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n80k32.s32.s8.u8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39},"
-      " %40,"
-      " %41,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x80x32_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x80x32 TN S32+=S8*U8
-struct SM90_64x80x32_S32S8U8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[40];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %42, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n80k32.s32.s8.u8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39},"
-      " %40,"
-      " %41,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x80x32_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x96x32 TN S32+=S8*U8
-struct SM90_64x96x32_S32S8U8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[48];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %50, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n96k32.s32.s8.u8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39, "
-      " %40, %41, %42, %43, %44, %45, %46, %47},"
-      " %48,"
-      " %49,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x96x32_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x96x32 TN S32+=S8*U8
-struct SM90_64x96x32_S32S8U8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[48];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %50, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n96k32.s32.s8.u8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39, "
-      " %40, %41, %42, %43, %44, %45, %46, %47},"
-      " %48,"
-      " %49,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x96x32_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x112x32 TN S32+=S8*U8
-struct SM90_64x112x32_S32S8U8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[56];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %58, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n112k32.s32.s8.u8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
-      " %56,"
-      " %57,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x112x32_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x112x32 TN S32+=S8*U8
-struct SM90_64x112x32_S32S8U8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[56];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %58, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n112k32.s32.s8.u8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
-      " %56,"
-      " %57,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x112x32_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x128x32 TN S32+=S8*U8
-struct SM90_64x128x32_S32S8U8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[64];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %66, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n128k32.s32.s8.u8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
-      " %64,"
-      " %65,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x128x32_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x128x32 TN S32+=S8*U8
-struct SM90_64x128x32_S32S8U8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[64];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %66, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n128k32.s32.s8.u8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
-      " %64,"
-      " %65,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x128x32_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x144x32 TN S32+=S8*U8
-struct SM90_64x144x32_S32S8U8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[72];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %74, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n144k32.s32.s8.u8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
-      " %72,"
-      " %73,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x144x32_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x144x32 TN S32+=S8*U8
-struct SM90_64x144x32_S32S8U8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[72];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %74, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n144k32.s32.s8.u8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
-      " %72,"
-      " %73,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x144x32_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x160x32 TN S32+=S8*U8
-struct SM90_64x160x32_S32S8U8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[80];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %82, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n160k32.s32.s8.u8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
-      " %80,"
-      " %81,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x160x32_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x160x32 TN S32+=S8*U8
-struct SM90_64x160x32_S32S8U8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[80];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %82, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n160k32.s32.s8.u8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
-      " %80,"
-      " %81,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x160x32_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x176x32 TN S32+=S8*U8
-struct SM90_64x176x32_S32S8U8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[88];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
-      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %90, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n176k32.s32.s8.u8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
-      " %88,"
-      " %89,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
-        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
-        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x176x32_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x176x32 TN S32+=S8*U8
-struct SM90_64x176x32_S32S8U8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[88];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
-      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %90, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n176k32.s32.s8.u8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
-      " %88,"
-      " %89,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
-        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
-        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x176x32_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x192x32 TN S32+=S8*U8
-struct SM90_64x192x32_S32S8U8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[96];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
-      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
-      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
-      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %98, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n192k32.s32.s8.u8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
-      " %96,"
-      " %97,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
-        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
-        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
-        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
-        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x192x32_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x192x32 TN S32+=S8*U8
-struct SM90_64x192x32_S32S8U8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[96];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
-      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
-      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
-      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %98, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n192k32.s32.s8.u8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
-      " %96,"
-      " %97,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
-        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
-        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
-        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
-        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x192x32_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x208x32 TN S32+=S8*U8
-struct SM90_64x208x32_S32S8U8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[104];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %106, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n208k32.s32.s8.u8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
-      " %104,"
-      " %105,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x208x32_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x208x32 TN S32+=S8*U8
-struct SM90_64x208x32_S32S8U8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[104];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %106, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n208k32.s32.s8.u8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
-      " %104,"
-      " %105,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x208x32_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x224x32 TN S32+=S8*U8
-struct SM90_64x224x32_S32S8U8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[112];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %114, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n224k32.s32.s8.u8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111},"
-      " %112,"
-      " %113,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x224x32_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x224x32 TN S32+=S8*U8
-struct SM90_64x224x32_S32S8U8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[112];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %114, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n224k32.s32.s8.u8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111},"
-      " %112,"
-      " %113,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x224x32_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x240x32 TN S32+=S8*U8
-struct SM90_64x240x32_S32S8U8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[120];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
-      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %122, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n240k32.s32.s8.u8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119},"
-      " %120,"
-      " %121,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
-        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
-        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x240x32_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x240x32 TN S32+=S8*U8
-struct SM90_64x240x32_S32S8U8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[120];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
-      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %122, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n240k32.s32.s8.u8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119},"
-      " %120,"
-      " %121,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
-        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
-        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x240x32_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x256x32 TN S32+=S8*U8
-struct SM90_64x256x32_S32S8U8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[128];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
-      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
-      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
-      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %130, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n256k32.s32.s8.u8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119, "
-      " %120, %121, %122, %123, %124, %125, %126, %127},"
-      " %128,"
-      " %129,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
-        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
-        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
-        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
-        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x256x32_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x256x32 TN S32+=S8*U8
-struct SM90_64x256x32_S32S8U8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[128];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
-      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
-      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
-      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %130, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n256k32.s32.s8.u8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119, "
-      " %120, %121, %122, %123, %124, %125, %126, %127},"
-      " %128,"
-      " %129,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
-        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
-        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
-        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
-        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x256x32_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x8x32 TN S32+=S8*U8
-struct SM90_64x8x32_S32S8U8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[4];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
-      uint64_t const& desc_b,
-      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %9, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n8k32.s32.s8.u8 "
-      "{%0,  %1,  %2,  %3},"
-      "{%4,  %5,  %6,  %7},"
-      " %8,"
-      " p;\n"
-    "}\n"
-      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
-      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x8x32_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x8x32 TN S32+=S8*U8
-struct SM90_64x8x32_S32S8U8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[4];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
-      uint64_t const& desc_b,
-      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %9, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n8k32.s32.s8.u8.satfinite "
-      "{%0,  %1,  %2,  %3},"
-      "{%4,  %5,  %6,  %7},"
-      " %8,"
-      " p;\n"
-    "}\n"
-      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
-      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x8x32_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x16x32 TN S32+=S8*U8
-struct SM90_64x16x32_S32S8U8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[8];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
-      uint64_t const& desc_b,
-      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
-      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %13, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n16k32.s32.s8.u8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
-      "{%8,  %9,  %10, %11},"
-      " %12,"
-      " p;\n"
-    "}\n"
-      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
-        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
-      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x16x32_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x16x32 TN S32+=S8*U8
-struct SM90_64x16x32_S32S8U8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[8];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
-      uint64_t const& desc_b,
-      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
-      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %13, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n16k32.s32.s8.u8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
-      "{%8,  %9,  %10, %11},"
-      " %12,"
-      " p;\n"
-    "}\n"
-      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
-        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
-      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x16x32_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x32x32 TN S32+=S8*U8
-struct SM90_64x32x32_S32S8U8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[16];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %21, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n32k32.s32.s8.u8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
-      "{%16, %17, %18, %19},"
-      " %20,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x32x32_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x32x32 TN S32+=S8*U8
-struct SM90_64x32x32_S32S8U8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[16];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %21, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n32k32.s32.s8.u8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
-      "{%16, %17, %18, %19},"
-      " %20,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x32x32_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x48x32 TN S32+=S8*U8
-struct SM90_64x48x32_S32S8U8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[24];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %29, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n48k32.s32.s8.u8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23},"
-      "{%24, %25, %26, %27},"
-      " %28,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x48x32_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x48x32 TN S32+=S8*U8
-struct SM90_64x48x32_S32S8U8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[24];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %29, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n48k32.s32.s8.u8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23},"
-      "{%24, %25, %26, %27},"
-      " %28,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x48x32_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x64x32 TN S32+=S8*U8
-struct SM90_64x64x32_S32S8U8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[32];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %37, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n64k32.s32.s8.u8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31},"
-      "{%32, %33, %34, %35},"
-      " %36,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x64x32_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x64x32 TN S32+=S8*U8
-struct SM90_64x64x32_S32S8U8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[32];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %37, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n64k32.s32.s8.u8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31},"
-      "{%32, %33, %34, %35},"
-      " %36,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x64x32_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x80x32 TN S32+=S8*U8
-struct SM90_64x80x32_S32S8U8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[40];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %45, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n80k32.s32.s8.u8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39},"
-      "{%40, %41, %42, %43},"
-      " %44,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x80x32_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x80x32 TN S32+=S8*U8
-struct SM90_64x80x32_S32S8U8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[40];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %45, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n80k32.s32.s8.u8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39},"
-      "{%40, %41, %42, %43},"
-      " %44,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x80x32_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x96x32 TN S32+=S8*U8
-struct SM90_64x96x32_S32S8U8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[48];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %53, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n96k32.s32.s8.u8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
-      "{%48,  %49,  %50,  %51},"
-      " %52,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x96x32_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x96x32 TN S32+=S8*U8
-struct SM90_64x96x32_S32S8U8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[48];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %53, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n96k32.s32.s8.u8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
-      "{%48,  %49,  %50,  %51},"
-      " %52,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x96x32_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x112x32 TN S32+=S8*U8
-struct SM90_64x112x32_S32S8U8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[56];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %61, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n112k32.s32.s8.u8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
-      "{%56,  %57,  %58,  %59},"
-      " %60,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x112x32_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x112x32 TN S32+=S8*U8
-struct SM90_64x112x32_S32S8U8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[56];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %61, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n112k32.s32.s8.u8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
-      "{%56,  %57,  %58,  %59},"
-      " %60,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x112x32_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x128x32 TN S32+=S8*U8
-struct SM90_64x128x32_S32S8U8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[64];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %69, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n128k32.s32.s8.u8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
-      "{%64,  %65,  %66,  %67},"
-      " %68,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x128x32_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x128x32 TN S32+=S8*U8
-struct SM90_64x128x32_S32S8U8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[64];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %69, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n128k32.s32.s8.u8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
-      "{%64,  %65,  %66,  %67},"
-      " %68,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x128x32_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x144x32 TN S32+=S8*U8
-struct SM90_64x144x32_S32S8U8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[72];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %77, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n144k32.s32.s8.u8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
-      "{%72,  %73,  %74,  %75},"
-      " %76,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x144x32_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x144x32 TN S32+=S8*U8
-struct SM90_64x144x32_S32S8U8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[72];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %77, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n144k32.s32.s8.u8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
-      "{%72,  %73,  %74,  %75},"
-      " %76,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x144x32_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x160x32 TN S32+=S8*U8
-struct SM90_64x160x32_S32S8U8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[80];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %85, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n160k32.s32.s8.u8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
-      "{%80,  %81,  %82,  %83},"
-      " %84,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x160x32_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x160x32 TN S32+=S8*U8
-struct SM90_64x160x32_S32S8U8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[80];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %85, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n160k32.s32.s8.u8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
-      "{%80,  %81,  %82,  %83},"
-      " %84,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x160x32_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x176x32 TN S32+=S8*U8
-struct SM90_64x176x32_S32S8U8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[88];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
-      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %93, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n176k32.s32.s8.u8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
-      "{%88,  %89,  %90,  %91},"
-      " %92,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
-        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
-        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x176x32_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x176x32 TN S32+=S8*U8
-struct SM90_64x176x32_S32S8U8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[88];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
-      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %93, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n176k32.s32.s8.u8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
-      "{%88,  %89,  %90,  %91},"
-      " %92,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
-        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
-        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x176x32_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x192x32 TN S32+=S8*U8
-struct SM90_64x192x32_S32S8U8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[96];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
-      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
-      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
-      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %101, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n192k32.s32.s8.u8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
-      "{%96,  %97,  %98,  %99},"
-      " %100,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
-        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
-        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
-        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
-        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x192x32_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x192x32 TN S32+=S8*U8
-struct SM90_64x192x32_S32S8U8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[96];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
-      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
-      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
-      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %101, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n192k32.s32.s8.u8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
-      "{%96,  %97,  %98,  %99},"
-      " %100,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
-        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
-        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
-        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
-        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x192x32_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x208x32 TN S32+=S8*U8
-struct SM90_64x208x32_S32S8U8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[104];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %109, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n208k32.s32.s8.u8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
-      "{%104, %105, %106, %107},"
-      " %108,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x208x32_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x208x32 TN S32+=S8*U8
-struct SM90_64x208x32_S32S8U8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[104];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %109, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n208k32.s32.s8.u8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
-      "{%104, %105, %106, %107},"
-      " %108,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x208x32_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x224x32 TN S32+=S8*U8
-struct SM90_64x224x32_S32S8U8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[112];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %117, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n224k32.s32.s8.u8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111},"
-      "{%112, %113, %114, %115},"
-      " %116,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x224x32_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x224x32 TN S32+=S8*U8
-struct SM90_64x224x32_S32S8U8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[112];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %117, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n224k32.s32.s8.u8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111},"
-      "{%112, %113, %114, %115},"
-      " %116,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x224x32_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x240x32 TN S32+=S8*U8
-struct SM90_64x240x32_S32S8U8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[120];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
-      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %125, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n240k32.s32.s8.u8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119},"
-      "{%120, %121, %122, %123},"
-      " %124,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
-        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
-        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x240x32_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x240x32 TN S32+=S8*U8
-struct SM90_64x240x32_S32S8U8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[120];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
-      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %125, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n240k32.s32.s8.u8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119},"
-      "{%120, %121, %122, %123},"
-      " %124,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
-        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
-        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x240x32_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x256x32 TN S32+=S8*U8
-struct SM90_64x256x32_S32S8U8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[128];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
-      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
-      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
-      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %133, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n256k32.s32.s8.u8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119, "
-      " %120, %121, %122, %123, %124, %125, %126, %127},"
-      "{%128, %129, %130, %131},"
-      " %132,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
-        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
-        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
-        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
-        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x256x32_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x256x32 TN S32+=S8*U8
-struct SM90_64x256x32_S32S8U8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[128];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
-      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
-      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
-      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %133, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n256k32.s32.s8.u8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119, "
-      " %120, %121, %122, %123, %124, %125, %126, %127},"
-      "{%128, %129, %130, %131},"
-      " %132,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
-        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
-        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
-        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
-        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x256x32_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x8x32 TN S32+=U8*S8
-struct SM90_64x8x32_S32U8S8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[4];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %6, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n8k32.s32.u8.s8 "
-      "{%0,  %1,  %2,  %3},"
-      " %4,"
-      " %5,"
-      " p;\n"
-    "}\n"
-      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x8x32_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x8x32 TN S32+=U8*S8
-struct SM90_64x8x32_S32U8S8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[4];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %6, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n8k32.s32.u8.s8.satfinite "
-      "{%0,  %1,  %2,  %3},"
-      " %4,"
-      " %5,"
-      " p;\n"
-    "}\n"
-      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x8x32_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x16x32 TN S32+=U8*S8
-struct SM90_64x16x32_S32U8S8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[8];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
-      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %10, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n16k32.s32.u8.s8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
-      " %8,"
-      " %9,"
-      " p;\n"
-    "}\n"
-      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
-        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x16x32_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x16x32 TN S32+=U8*S8
-struct SM90_64x16x32_S32U8S8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[8];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
-      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %10, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n16k32.s32.u8.s8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
-      " %8,"
-      " %9,"
-      " p;\n"
-    "}\n"
-      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
-        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x16x32_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x32x32 TN S32+=U8*S8
-struct SM90_64x32x32_S32U8S8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[16];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %18, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n32k32.s32.u8.s8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
-      " %16,"
-      " %17,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x32x32_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x32x32 TN S32+=U8*S8
-struct SM90_64x32x32_S32U8S8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[16];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %18, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n32k32.s32.u8.s8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
-      " %16,"
-      " %17,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x32x32_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x48x32 TN S32+=U8*S8
-struct SM90_64x48x32_S32U8S8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[24];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %26, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n48k32.s32.u8.s8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23},"
-      " %24,"
-      " %25,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x48x32_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x48x32 TN S32+=U8*S8
-struct SM90_64x48x32_S32U8S8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[24];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %26, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n48k32.s32.u8.s8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23},"
-      " %24,"
-      " %25,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x48x32_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x64x32 TN S32+=U8*S8
-struct SM90_64x64x32_S32U8S8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[32];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %34, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n64k32.s32.u8.s8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31},"
-      " %32,"
-      " %33,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x64x32_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x64x32 TN S32+=U8*S8
-struct SM90_64x64x32_S32U8S8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[32];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %34, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n64k32.s32.u8.s8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31},"
-      " %32,"
-      " %33,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x64x32_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x80x32 TN S32+=U8*S8
-struct SM90_64x80x32_S32U8S8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[40];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %42, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n80k32.s32.u8.s8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39},"
-      " %40,"
-      " %41,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x80x32_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x80x32 TN S32+=U8*S8
-struct SM90_64x80x32_S32U8S8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[40];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %42, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n80k32.s32.u8.s8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39},"
-      " %40,"
-      " %41,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x80x32_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x96x32 TN S32+=U8*S8
-struct SM90_64x96x32_S32U8S8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[48];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %50, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n96k32.s32.u8.s8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39, "
-      " %40, %41, %42, %43, %44, %45, %46, %47},"
-      " %48,"
-      " %49,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x96x32_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x96x32 TN S32+=U8*S8
-struct SM90_64x96x32_S32U8S8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[48];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %50, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n96k32.s32.u8.s8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39, "
-      " %40, %41, %42, %43, %44, %45, %46, %47},"
-      " %48,"
-      " %49,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x96x32_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x112x32 TN S32+=U8*S8
-struct SM90_64x112x32_S32U8S8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[56];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %58, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n112k32.s32.u8.s8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
-      " %56,"
-      " %57,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x112x32_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x112x32 TN S32+=U8*S8
-struct SM90_64x112x32_S32U8S8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[56];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %58, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n112k32.s32.u8.s8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
-      " %56,"
-      " %57,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x112x32_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x128x32 TN S32+=U8*S8
-struct SM90_64x128x32_S32U8S8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[64];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %66, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n128k32.s32.u8.s8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
-      " %64,"
-      " %65,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x128x32_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x128x32 TN S32+=U8*S8
-struct SM90_64x128x32_S32U8S8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[64];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %66, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n128k32.s32.u8.s8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
-      " %64,"
-      " %65,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x128x32_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x144x32 TN S32+=U8*S8
-struct SM90_64x144x32_S32U8S8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[72];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %74, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n144k32.s32.u8.s8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
-      " %72,"
-      " %73,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x144x32_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x144x32 TN S32+=U8*S8
-struct SM90_64x144x32_S32U8S8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[72];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %74, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n144k32.s32.u8.s8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
-      " %72,"
-      " %73,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x144x32_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x160x32 TN S32+=U8*S8
-struct SM90_64x160x32_S32U8S8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[80];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %82, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n160k32.s32.u8.s8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
-      " %80,"
-      " %81,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x160x32_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x160x32 TN S32+=U8*S8
-struct SM90_64x160x32_S32U8S8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[80];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %82, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n160k32.s32.u8.s8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
-      " %80,"
-      " %81,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x160x32_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x176x32 TN S32+=U8*S8
-struct SM90_64x176x32_S32U8S8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[88];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
-      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %90, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n176k32.s32.u8.s8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
-      " %88,"
-      " %89,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
-        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
-        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x176x32_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x176x32 TN S32+=U8*S8
-struct SM90_64x176x32_S32U8S8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[88];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
-      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %90, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n176k32.s32.u8.s8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
-      " %88,"
-      " %89,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
-        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
-        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x176x32_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x192x32 TN S32+=U8*S8
-struct SM90_64x192x32_S32U8S8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[96];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
-      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
-      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
-      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %98, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n192k32.s32.u8.s8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
-      " %96,"
-      " %97,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
-        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
-        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
-        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
-        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x192x32_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x192x32 TN S32+=U8*S8
-struct SM90_64x192x32_S32U8S8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[96];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
-      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
-      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
-      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %98, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n192k32.s32.u8.s8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
-      " %96,"
-      " %97,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
-        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
-        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
-        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
-        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x192x32_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x208x32 TN S32+=U8*S8
-struct SM90_64x208x32_S32U8S8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[104];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %106, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n208k32.s32.u8.s8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
-      " %104,"
-      " %105,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x208x32_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x208x32 TN S32+=U8*S8
-struct SM90_64x208x32_S32U8S8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[104];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %106, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n208k32.s32.u8.s8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
-      " %104,"
-      " %105,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x208x32_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x224x32 TN S32+=U8*S8
-struct SM90_64x224x32_S32U8S8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[112];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %114, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n224k32.s32.u8.s8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111},"
-      " %112,"
-      " %113,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x224x32_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x224x32 TN S32+=U8*S8
-struct SM90_64x224x32_S32U8S8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[112];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %114, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n224k32.s32.u8.s8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111},"
-      " %112,"
-      " %113,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x224x32_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x240x32 TN S32+=U8*S8
-struct SM90_64x240x32_S32U8S8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[120];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
-      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %122, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n240k32.s32.u8.s8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119},"
-      " %120,"
-      " %121,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
-        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
-        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x240x32_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x240x32 TN S32+=U8*S8
-struct SM90_64x240x32_S32U8S8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[120];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
-      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %122, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n240k32.s32.u8.s8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119},"
-      " %120,"
-      " %121,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
-        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
-        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x240x32_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x256x32 TN S32+=U8*S8
-struct SM90_64x256x32_S32U8S8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[128];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
-      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
-      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
-      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %130, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n256k32.s32.u8.s8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119, "
-      " %120, %121, %122, %123, %124, %125, %126, %127},"
-      " %128,"
-      " %129,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
-        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
-        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
-        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
-        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x256x32_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x256x32 TN S32+=U8*S8
-struct SM90_64x256x32_S32U8S8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[128];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
-      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
-      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
-      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %130, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n256k32.s32.u8.s8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119, "
-      " %120, %121, %122, %123, %124, %125, %126, %127},"
-      " %128,"
-      " %129,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
-        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
-        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
-        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
-        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x256x32_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x8x32 TN S32+=U8*S8
-struct SM90_64x8x32_S32U8S8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[4];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
-      uint64_t const& desc_b,
-      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %9, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n8k32.s32.u8.s8 "
-      "{%0,  %1,  %2,  %3},"
-      "{%4,  %5,  %6,  %7},"
-      " %8,"
-      " p;\n"
-    "}\n"
-      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
-      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x8x32_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x8x32 TN S32+=U8*S8
-struct SM90_64x8x32_S32U8S8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[4];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
-      uint64_t const& desc_b,
-      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %9, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n8k32.s32.u8.s8.satfinite "
-      "{%0,  %1,  %2,  %3},"
-      "{%4,  %5,  %6,  %7},"
-      " %8,"
-      " p;\n"
-    "}\n"
-      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
-      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x8x32_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x16x32 TN S32+=U8*S8
-struct SM90_64x16x32_S32U8S8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[8];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
-      uint64_t const& desc_b,
-      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
-      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %13, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n16k32.s32.u8.s8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
-      "{%8,  %9,  %10, %11},"
-      " %12,"
-      " p;\n"
-    "}\n"
-      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
-        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
-      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x16x32_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x16x32 TN S32+=U8*S8
-struct SM90_64x16x32_S32U8S8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[8];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
-      uint64_t const& desc_b,
-      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
-      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %13, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n16k32.s32.u8.s8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
-      "{%8,  %9,  %10, %11},"
-      " %12,"
-      " p;\n"
-    "}\n"
-      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
-        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
-      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x16x32_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x32x32 TN S32+=U8*S8
-struct SM90_64x32x32_S32U8S8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[16];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %21, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n32k32.s32.u8.s8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
-      "{%16, %17, %18, %19},"
-      " %20,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x32x32_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x32x32 TN S32+=U8*S8
-struct SM90_64x32x32_S32U8S8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[16];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %21, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n32k32.s32.u8.s8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
-      "{%16, %17, %18, %19},"
-      " %20,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x32x32_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x48x32 TN S32+=U8*S8
-struct SM90_64x48x32_S32U8S8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[24];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %29, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n48k32.s32.u8.s8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23},"
-      "{%24, %25, %26, %27},"
-      " %28,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x48x32_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x48x32 TN S32+=U8*S8
-struct SM90_64x48x32_S32U8S8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[24];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %29, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n48k32.s32.u8.s8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23},"
-      "{%24, %25, %26, %27},"
-      " %28,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x48x32_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x64x32 TN S32+=U8*S8
-struct SM90_64x64x32_S32U8S8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[32];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %37, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n64k32.s32.u8.s8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31},"
-      "{%32, %33, %34, %35},"
-      " %36,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x64x32_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x64x32 TN S32+=U8*S8
-struct SM90_64x64x32_S32U8S8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[32];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %37, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n64k32.s32.u8.s8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31},"
-      "{%32, %33, %34, %35},"
-      " %36,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x64x32_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x80x32 TN S32+=U8*S8
-struct SM90_64x80x32_S32U8S8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[40];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %45, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n80k32.s32.u8.s8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39},"
-      "{%40, %41, %42, %43},"
-      " %44,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x80x32_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x80x32 TN S32+=U8*S8
-struct SM90_64x80x32_S32U8S8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[40];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %45, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n80k32.s32.u8.s8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39},"
-      "{%40, %41, %42, %43},"
-      " %44,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x80x32_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x96x32 TN S32+=U8*S8
-struct SM90_64x96x32_S32U8S8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[48];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %53, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n96k32.s32.u8.s8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
-      "{%48,  %49,  %50,  %51},"
-      " %52,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x96x32_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x96x32 TN S32+=U8*S8
-struct SM90_64x96x32_S32U8S8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[48];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %53, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n96k32.s32.u8.s8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
-      "{%48,  %49,  %50,  %51},"
-      " %52,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x96x32_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x112x32 TN S32+=U8*S8
-struct SM90_64x112x32_S32U8S8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[56];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %61, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n112k32.s32.u8.s8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
-      "{%56,  %57,  %58,  %59},"
-      " %60,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x112x32_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x112x32 TN S32+=U8*S8
-struct SM90_64x112x32_S32U8S8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[56];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %61, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n112k32.s32.u8.s8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
-      "{%56,  %57,  %58,  %59},"
-      " %60,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x112x32_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x128x32 TN S32+=U8*S8
-struct SM90_64x128x32_S32U8S8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[64];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %69, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n128k32.s32.u8.s8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
-      "{%64,  %65,  %66,  %67},"
-      " %68,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x128x32_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x128x32 TN S32+=U8*S8
-struct SM90_64x128x32_S32U8S8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[64];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %69, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n128k32.s32.u8.s8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
-      "{%64,  %65,  %66,  %67},"
-      " %68,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x128x32_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x144x32 TN S32+=U8*S8
-struct SM90_64x144x32_S32U8S8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[72];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %77, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n144k32.s32.u8.s8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
-      "{%72,  %73,  %74,  %75},"
-      " %76,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x144x32_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x144x32 TN S32+=U8*S8
-struct SM90_64x144x32_S32U8S8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[72];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %77, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n144k32.s32.u8.s8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
-      "{%72,  %73,  %74,  %75},"
-      " %76,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x144x32_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x160x32 TN S32+=U8*S8
-struct SM90_64x160x32_S32U8S8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[80];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %85, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n160k32.s32.u8.s8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
-      "{%80,  %81,  %82,  %83},"
-      " %84,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x160x32_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x160x32 TN S32+=U8*S8
-struct SM90_64x160x32_S32U8S8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[80];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %85, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n160k32.s32.u8.s8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
-      "{%80,  %81,  %82,  %83},"
-      " %84,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x160x32_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x176x32 TN S32+=U8*S8
-struct SM90_64x176x32_S32U8S8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[88];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
-      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %93, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n176k32.s32.u8.s8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
-      "{%88,  %89,  %90,  %91},"
-      " %92,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
-        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
-        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x176x32_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x176x32 TN S32+=U8*S8
-struct SM90_64x176x32_S32U8S8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[88];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
-      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %93, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n176k32.s32.u8.s8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
-      "{%88,  %89,  %90,  %91},"
-      " %92,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
-        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
-        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x176x32_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x192x32 TN S32+=U8*S8
-struct SM90_64x192x32_S32U8S8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[96];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
-      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
-      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
-      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %101, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n192k32.s32.u8.s8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
-      "{%96,  %97,  %98,  %99},"
-      " %100,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
-        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
-        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
-        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
-        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x192x32_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x192x32 TN S32+=U8*S8
-struct SM90_64x192x32_S32U8S8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[96];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
-      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
-      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
-      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %101, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n192k32.s32.u8.s8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
-      "{%96,  %97,  %98,  %99},"
-      " %100,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
-        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
-        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
-        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
-        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x192x32_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x208x32 TN S32+=U8*S8
-struct SM90_64x208x32_S32U8S8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[104];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %109, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n208k32.s32.u8.s8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
-      "{%104, %105, %106, %107},"
-      " %108,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x208x32_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x208x32 TN S32+=U8*S8
-struct SM90_64x208x32_S32U8S8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[104];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %109, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n208k32.s32.u8.s8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
-      "{%104, %105, %106, %107},"
-      " %108,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x208x32_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x224x32 TN S32+=U8*S8
-struct SM90_64x224x32_S32U8S8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[112];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %117, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n224k32.s32.u8.s8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111},"
-      "{%112, %113, %114, %115},"
-      " %116,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x224x32_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x224x32 TN S32+=U8*S8
-struct SM90_64x224x32_S32U8S8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[112];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %117, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n224k32.s32.u8.s8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111},"
-      "{%112, %113, %114, %115},"
-      " %116,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x224x32_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x240x32 TN S32+=U8*S8
-struct SM90_64x240x32_S32U8S8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[120];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
-      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %125, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n240k32.s32.u8.s8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119},"
-      "{%120, %121, %122, %123},"
-      " %124,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
-        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
-        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x240x32_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x240x32 TN S32+=U8*S8
-struct SM90_64x240x32_S32U8S8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[120];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
-      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %125, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n240k32.s32.u8.s8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119},"
-      "{%120, %121, %122, %123},"
-      " %124,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
-        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
-        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x240x32_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x256x32 TN S32+=U8*S8
-struct SM90_64x256x32_S32U8S8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[128];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
-      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
-      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
-      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %133, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n256k32.s32.u8.s8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119, "
-      " %120, %121, %122, %123, %124, %125, %126, %127},"
-      "{%128, %129, %130, %131},"
-      " %132,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
-        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
-        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
-        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
-        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x256x32_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x256x32 TN S32+=U8*S8
-struct SM90_64x256x32_S32U8S8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[128];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
-      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
-      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
-      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %133, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n256k32.s32.u8.s8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119, "
-      " %120, %121, %122, %123, %124, %125, %126, %127},"
-      "{%128, %129, %130, %131},"
-      " %132,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
-        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
-        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
-        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
-        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x256x32_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x8x32 TN S32+=U8*U8
-struct SM90_64x8x32_S32U8U8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[4];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %6, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n8k32.s32.u8.u8 "
-      "{%0,  %1,  %2,  %3},"
-      " %4,"
-      " %5,"
-      " p;\n"
-    "}\n"
-      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x8x32_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x8x32 TN S32+=U8*U8
-struct SM90_64x8x32_S32U8U8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[4];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %6, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n8k32.s32.u8.u8.satfinite "
-      "{%0,  %1,  %2,  %3},"
-      " %4,"
-      " %5,"
-      " p;\n"
-    "}\n"
-      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x8x32_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x16x32 TN S32+=U8*U8
-struct SM90_64x16x32_S32U8U8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[8];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
-      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %10, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n16k32.s32.u8.u8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
-      " %8,"
-      " %9,"
-      " p;\n"
-    "}\n"
-      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
-        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x16x32_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x16x32 TN S32+=U8*U8
-struct SM90_64x16x32_S32U8U8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[8];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
-      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %10, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n16k32.s32.u8.u8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
-      " %8,"
-      " %9,"
-      " p;\n"
-    "}\n"
-      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
-        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x16x32_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x32x32 TN S32+=U8*U8
-struct SM90_64x32x32_S32U8U8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[16];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %18, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n32k32.s32.u8.u8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
-      " %16,"
-      " %17,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x32x32_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x32x32 TN S32+=U8*U8
-struct SM90_64x32x32_S32U8U8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[16];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %18, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n32k32.s32.u8.u8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
-      " %16,"
-      " %17,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x32x32_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x48x32 TN S32+=U8*U8
-struct SM90_64x48x32_S32U8U8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[24];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %26, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n48k32.s32.u8.u8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23},"
-      " %24,"
-      " %25,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x48x32_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x48x32 TN S32+=U8*U8
-struct SM90_64x48x32_S32U8U8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[24];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %26, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n48k32.s32.u8.u8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23},"
-      " %24,"
-      " %25,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x48x32_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x64x32 TN S32+=U8*U8
-struct SM90_64x64x32_S32U8U8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[32];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %34, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n64k32.s32.u8.u8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31},"
-      " %32,"
-      " %33,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x64x32_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x64x32 TN S32+=U8*U8
-struct SM90_64x64x32_S32U8U8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[32];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %34, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n64k32.s32.u8.u8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31},"
-      " %32,"
-      " %33,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x64x32_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x80x32 TN S32+=U8*U8
-struct SM90_64x80x32_S32U8U8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[40];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %42, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n80k32.s32.u8.u8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39},"
-      " %40,"
-      " %41,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x80x32_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x80x32 TN S32+=U8*U8
-struct SM90_64x80x32_S32U8U8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[40];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %42, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n80k32.s32.u8.u8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39},"
-      " %40,"
-      " %41,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x80x32_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x96x32 TN S32+=U8*U8
-struct SM90_64x96x32_S32U8U8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[48];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %50, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n96k32.s32.u8.u8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39, "
-      " %40, %41, %42, %43, %44, %45, %46, %47},"
-      " %48,"
-      " %49,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x96x32_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x96x32 TN S32+=U8*U8
-struct SM90_64x96x32_S32U8U8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[48];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %50, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n96k32.s32.u8.u8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39, "
-      " %40, %41, %42, %43, %44, %45, %46, %47},"
-      " %48,"
-      " %49,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x96x32_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x112x32 TN S32+=U8*U8
-struct SM90_64x112x32_S32U8U8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[56];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %58, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n112k32.s32.u8.u8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
-      " %56,"
-      " %57,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x112x32_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x112x32 TN S32+=U8*U8
-struct SM90_64x112x32_S32U8U8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[56];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %58, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n112k32.s32.u8.u8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
-      " %56,"
-      " %57,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x112x32_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x128x32 TN S32+=U8*U8
-struct SM90_64x128x32_S32U8U8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[64];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %66, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n128k32.s32.u8.u8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
-      " %64,"
-      " %65,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x128x32_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x128x32 TN S32+=U8*U8
-struct SM90_64x128x32_S32U8U8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[64];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %66, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n128k32.s32.u8.u8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
-      " %64,"
-      " %65,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x128x32_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x144x32 TN S32+=U8*U8
-struct SM90_64x144x32_S32U8U8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[72];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %74, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n144k32.s32.u8.u8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
-      " %72,"
-      " %73,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x144x32_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x144x32 TN S32+=U8*U8
-struct SM90_64x144x32_S32U8U8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[72];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %74, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n144k32.s32.u8.u8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
-      " %72,"
-      " %73,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x144x32_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x160x32 TN S32+=U8*U8
-struct SM90_64x160x32_S32U8U8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[80];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %82, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n160k32.s32.u8.u8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
-      " %80,"
-      " %81,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x160x32_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x160x32 TN S32+=U8*U8
-struct SM90_64x160x32_S32U8U8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[80];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %82, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n160k32.s32.u8.u8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
-      " %80,"
-      " %81,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x160x32_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x176x32 TN S32+=U8*U8
-struct SM90_64x176x32_S32U8U8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[88];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
-      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %90, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n176k32.s32.u8.u8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
-      " %88,"
-      " %89,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
-        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
-        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x176x32_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x176x32 TN S32+=U8*U8
-struct SM90_64x176x32_S32U8U8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[88];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
-      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %90, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n176k32.s32.u8.u8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
-      " %88,"
-      " %89,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
-        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
-        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x176x32_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x192x32 TN S32+=U8*U8
-struct SM90_64x192x32_S32U8U8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[96];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
-      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
-      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
-      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %98, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n192k32.s32.u8.u8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
-      " %96,"
-      " %97,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
-        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
-        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
-        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
-        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x192x32_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x192x32 TN S32+=U8*U8
-struct SM90_64x192x32_S32U8U8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[96];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
-      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
-      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
-      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %98, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n192k32.s32.u8.u8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
-      " %96,"
-      " %97,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
-        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
-        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
-        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
-        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x192x32_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x208x32 TN S32+=U8*U8
-struct SM90_64x208x32_S32U8U8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[104];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %106, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n208k32.s32.u8.u8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
-      " %104,"
-      " %105,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x208x32_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x208x32 TN S32+=U8*U8
-struct SM90_64x208x32_S32U8U8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[104];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %106, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n208k32.s32.u8.u8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
-      " %104,"
-      " %105,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x208x32_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x224x32 TN S32+=U8*U8
-struct SM90_64x224x32_S32U8U8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[112];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %114, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n224k32.s32.u8.u8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111},"
-      " %112,"
-      " %113,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x224x32_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x224x32 TN S32+=U8*U8
-struct SM90_64x224x32_S32U8U8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[112];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %114, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n224k32.s32.u8.u8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111},"
-      " %112,"
-      " %113,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x224x32_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x240x32 TN S32+=U8*U8
-struct SM90_64x240x32_S32U8U8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[120];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
-      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %122, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n240k32.s32.u8.u8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119},"
-      " %120,"
-      " %121,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
-        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
-        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x240x32_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x240x32 TN S32+=U8*U8
-struct SM90_64x240x32_S32U8U8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[120];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
-      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %122, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n240k32.s32.u8.u8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119},"
-      " %120,"
-      " %121,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
-        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
-        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x240x32_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x256x32 TN S32+=U8*U8
-struct SM90_64x256x32_S32U8U8_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[128];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
-      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
-      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
-      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %130, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n256k32.s32.u8.u8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119, "
-      " %120, %121, %122, %123, %124, %125, %126, %127},"
-      " %128,"
-      " %129,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
-        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
-        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
-        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
-        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x256x32_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x256x32 TN S32+=U8*U8
-struct SM90_64x256x32_S32U8U8_SS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[128];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
-      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
-      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
-      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %130, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n256k32.s32.u8.u8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119, "
-      " %120, %121, %122, %123, %124, %125, %126, %127},"
-      " %128,"
-      " %129,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
-        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
-        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
-        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
-        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x256x32_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x8x32 TN S32+=U8*U8
-struct SM90_64x8x32_S32U8U8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[4];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
-      uint64_t const& desc_b,
-      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %9, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n8k32.s32.u8.u8 "
-      "{%0,  %1,  %2,  %3},"
-      "{%4,  %5,  %6,  %7},"
-      " %8,"
-      " p;\n"
-    "}\n"
-      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
-      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x8x32_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x8x32 TN S32+=U8*U8
-struct SM90_64x8x32_S32U8U8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[4];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
-      uint64_t const& desc_b,
-      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %9, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n8k32.s32.u8.u8.satfinite "
-      "{%0,  %1,  %2,  %3},"
-      "{%4,  %5,  %6,  %7},"
-      " %8,"
-      " p;\n"
-    "}\n"
-      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
-      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x8x32_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x16x32 TN S32+=U8*U8
-struct SM90_64x16x32_S32U8U8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[8];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
-      uint64_t const& desc_b,
-      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
-      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %13, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n16k32.s32.u8.u8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
-      "{%8,  %9,  %10, %11},"
-      " %12,"
-      " p;\n"
-    "}\n"
-      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
-        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
-      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x16x32_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x16x32 TN S32+=U8*U8
-struct SM90_64x16x32_S32U8U8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[8];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
-      uint64_t const& desc_b,
-      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
-      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %13, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n16k32.s32.u8.u8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
-      "{%8,  %9,  %10, %11},"
-      " %12,"
-      " p;\n"
-    "}\n"
-      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
-        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
-      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x16x32_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x32x32 TN S32+=U8*U8
-struct SM90_64x32x32_S32U8U8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[16];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %21, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n32k32.s32.u8.u8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
-      "{%16, %17, %18, %19},"
-      " %20,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x32x32_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x32x32 TN S32+=U8*U8
-struct SM90_64x32x32_S32U8U8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[16];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %21, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n32k32.s32.u8.u8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
-      "{%16, %17, %18, %19},"
-      " %20,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x32x32_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x48x32 TN S32+=U8*U8
-struct SM90_64x48x32_S32U8U8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[24];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %29, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n48k32.s32.u8.u8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23},"
-      "{%24, %25, %26, %27},"
-      " %28,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x48x32_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x48x32 TN S32+=U8*U8
-struct SM90_64x48x32_S32U8U8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[24];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %29, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n48k32.s32.u8.u8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23},"
-      "{%24, %25, %26, %27},"
-      " %28,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x48x32_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x64x32 TN S32+=U8*U8
-struct SM90_64x64x32_S32U8U8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[32];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %37, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n64k32.s32.u8.u8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31},"
-      "{%32, %33, %34, %35},"
-      " %36,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x64x32_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x64x32 TN S32+=U8*U8
-struct SM90_64x64x32_S32U8U8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[32];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %37, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n64k32.s32.u8.u8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31},"
-      "{%32, %33, %34, %35},"
-      " %36,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x64x32_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x80x32 TN S32+=U8*U8
-struct SM90_64x80x32_S32U8U8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[40];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %45, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n80k32.s32.u8.u8 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39},"
-      "{%40, %41, %42, %43},"
-      " %44,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x80x32_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x80x32 TN S32+=U8*U8
-struct SM90_64x80x32_S32U8U8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[40];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %45, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n80k32.s32.u8.u8.satfinite "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39},"
-      "{%40, %41, %42, %43},"
-      " %44,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x80x32_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x96x32 TN S32+=U8*U8
-struct SM90_64x96x32_S32U8U8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[48];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %53, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n96k32.s32.u8.u8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
-      "{%48,  %49,  %50,  %51},"
-      " %52,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x96x32_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x96x32 TN S32+=U8*U8
-struct SM90_64x96x32_S32U8U8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[48];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %53, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n96k32.s32.u8.u8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
-      "{%48,  %49,  %50,  %51},"
-      " %52,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x96x32_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x112x32 TN S32+=U8*U8
-struct SM90_64x112x32_S32U8U8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[56];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %61, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n112k32.s32.u8.u8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
-      "{%56,  %57,  %58,  %59},"
-      " %60,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x112x32_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x112x32 TN S32+=U8*U8
-struct SM90_64x112x32_S32U8U8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[56];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %61, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n112k32.s32.u8.u8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
-      "{%56,  %57,  %58,  %59},"
-      " %60,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x112x32_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x128x32 TN S32+=U8*U8
-struct SM90_64x128x32_S32U8U8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[64];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %69, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n128k32.s32.u8.u8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
-      "{%64,  %65,  %66,  %67},"
-      " %68,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x128x32_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x128x32 TN S32+=U8*U8
-struct SM90_64x128x32_S32U8U8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[64];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %69, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n128k32.s32.u8.u8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
-      "{%64,  %65,  %66,  %67},"
-      " %68,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x128x32_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x144x32 TN S32+=U8*U8
-struct SM90_64x144x32_S32U8U8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[72];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %77, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n144k32.s32.u8.u8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
-      "{%72,  %73,  %74,  %75},"
-      " %76,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x144x32_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x144x32 TN S32+=U8*U8
-struct SM90_64x144x32_S32U8U8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[72];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %77, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n144k32.s32.u8.u8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
-      "{%72,  %73,  %74,  %75},"
-      " %76,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x144x32_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x160x32 TN S32+=U8*U8
-struct SM90_64x160x32_S32U8U8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[80];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %85, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n160k32.s32.u8.u8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
-      "{%80,  %81,  %82,  %83},"
-      " %84,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x160x32_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x160x32 TN S32+=U8*U8
-struct SM90_64x160x32_S32U8U8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[80];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %85, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n160k32.s32.u8.u8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
-      "{%80,  %81,  %82,  %83},"
-      " %84,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x160x32_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x176x32 TN S32+=U8*U8
-struct SM90_64x176x32_S32U8U8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[88];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
-      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %93, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n176k32.s32.u8.u8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
-      "{%88,  %89,  %90,  %91},"
-      " %92,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
-        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
-        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x176x32_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x176x32 TN S32+=U8*U8
-struct SM90_64x176x32_S32U8U8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[88];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
-      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %93, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n176k32.s32.u8.u8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
-      "{%88,  %89,  %90,  %91},"
-      " %92,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
-        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
-        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x176x32_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x192x32 TN S32+=U8*U8
-struct SM90_64x192x32_S32U8U8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[96];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
-      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
-      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
-      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %101, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n192k32.s32.u8.u8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
-      "{%96,  %97,  %98,  %99},"
-      " %100,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
-        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
-        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
-        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
-        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x192x32_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x192x32 TN S32+=U8*U8
-struct SM90_64x192x32_S32U8U8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[96];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
-      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
-      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
-      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
-      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
-      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
-      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
-      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
-      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
-      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %101, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n192k32.s32.u8.u8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
-      "{%96,  %97,  %98,  %99},"
-      " %100,"
-      " p;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
-        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
-        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
-        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
-        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
-        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
-        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
-        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
-        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x192x32_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x208x32 TN S32+=U8*U8
-struct SM90_64x208x32_S32U8U8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[104];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %109, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n208k32.s32.u8.u8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
-      "{%104, %105, %106, %107},"
-      " %108,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x208x32_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x208x32 TN S32+=U8*U8
-struct SM90_64x208x32_S32U8U8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[104];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %109, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n208k32.s32.u8.u8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
-      "{%104, %105, %106, %107},"
-      " %108,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x208x32_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x224x32 TN S32+=U8*U8
-struct SM90_64x224x32_S32U8U8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[112];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %117, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n224k32.s32.u8.u8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111},"
-      "{%112, %113, %114, %115},"
-      " %116,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x224x32_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x224x32 TN S32+=U8*U8
-struct SM90_64x224x32_S32U8U8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[112];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %117, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n224k32.s32.u8.u8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111},"
-      "{%112, %113, %114, %115},"
-      " %116,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x224x32_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x240x32 TN S32+=U8*U8
-struct SM90_64x240x32_S32U8U8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[120];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
-      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %125, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n240k32.s32.u8.u8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119},"
-      "{%120, %121, %122, %123},"
-      " %124,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
-        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
-        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x240x32_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x240x32 TN S32+=U8*U8
-struct SM90_64x240x32_S32U8U8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[120];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
-      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %125, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n240k32.s32.u8.u8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119},"
-      "{%120, %121, %122, %123},"
-      " %124,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
-        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
-        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x240x32_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x256x32 TN S32+=U8*U8
-struct SM90_64x256x32_S32U8U8_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[128];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
-      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
-      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
-      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %133, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n256k32.s32.u8.u8 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119, "
-      " %120, %121, %122, %123, %124, %125, %126, %127},"
-      "{%128, %129, %130, %131},"
-      " %132,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
-        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
-        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
-        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
-        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x256x32_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x256x32 TN S32+=U8*U8
-struct SM90_64x256x32_S32U8U8_RS_TN_SATURATE
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[128];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
-      uint64_t const& desc_b,
-      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
-      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
-      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
-      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
-      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
-      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
-      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
-      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
-      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
-      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
-      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
-      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
-      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
-      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
-      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
-      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
-      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
-      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
-      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
-      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
-      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
-      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
-      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
-      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
-      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
-      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
-      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
-      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
-      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
-      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
-      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
-      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %133, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n256k32.s32.u8.u8.satfinite "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119, "
-      " %120, %121, %122, %123, %124, %125, %126, %127},"
-      "{%128, %129, %130, %131},"
-      " %132,"
-      " p;\n"
-    "}\n"
-      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
-        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
-        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
-        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
-        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
-        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
-        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
-        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
-        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
-        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
-        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
-        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
-        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
-        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
-        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
-        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
-        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
-        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
-        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
-        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
-        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
-        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
-        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
-        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
-        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
-        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
-        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
-        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
-        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
-        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
-        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
-        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x256x32_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x8x32 TN F16+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x8x32_F16E4M3E4M3_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[2];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d0, uint32_t      & d1,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %4, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n8k32.f16.e4m3.e4m3 "
-      "{%0, %1},"
-      " %2,"
-      " %3,"
-      " p,  %5, %6;\n"
-    "}\n"
-      : "+r"(d0), "+r"(d1)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x8x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x8x32 TN F16+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x8x32_F16E4M3E4M3_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[2];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
-      uint64_t const& desc_b,
-      uint32_t      & d0, uint32_t      & d1,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %7, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n8k32.f16.e4m3.e4m3 "
-      "{%0,  %1},"
-      "{%2,  %3,  %4,  %5},"
-      " %6,"
-      " p,   %8,  %9;\n"
-    "}\n"
-      : "+r"(d0), "+r"(d1)
-      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x8x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x8x32 TN F32+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x8x32_F32E4M3E4M3_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[4];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      float         & d0, float         & d1, float         & d2, float         & d3,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %6, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n8k32.f32.e4m3.e4m3 "
-      "{%0,  %1,  %2,  %3},"
-      " %4,"
-      " %5,"
-      " p,   %7,  %8;\n"
-    "}\n"
-      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x8x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x8x32 TN F32+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x8x32_F32E4M3E4M3_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[4];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
-      uint64_t const& desc_b,
-      float         & d0, float         & d1, float         & d2, float         & d3,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %9, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n8k32.f32.e4m3.e4m3 "
-      "{%0,  %1,  %2,  %3},"
-      "{%4,  %5,  %6,  %7},"
-      " %8,"
-      " p,   %10, %11;\n"
-    "}\n"
-      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3)
-      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x8x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x16x32 TN F16+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x16x32_F16E4M3E4M3_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[4];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %6, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n16k32.f16.e4m3.e4m3 "
-      "{%0,  %1,  %2,  %3},"
-      " %4,"
-      " %5,"
-      " p,   %7,  %8;\n"
-    "}\n"
-      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x16x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x16x32 TN F16+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x16x32_F16E4M3E4M3_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[4];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
-      uint64_t const& desc_b,
-      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %9, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n16k32.f16.e4m3.e4m3 "
-      "{%0,  %1,  %2,  %3},"
-      "{%4,  %5,  %6,  %7},"
-      " %8,"
-      " p,   %10, %11;\n"
-    "}\n"
-      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
-      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x16x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x16x32 TN F32+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x16x32_F32E4M3E4M3_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[8];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      float         & d0, float         & d1, float         & d2, float         & d3,
-      float         & d4, float         & d5, float         & d6, float         & d7,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %10, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n16k32.f32.e4m3.e4m3 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
-      " %8,"
-      " %9,"
-      " p,   %11, %12;\n"
-    "}\n"
-      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3),
-        "+f"(d4), "+f"(d5), "+f"(d6), "+f"(d7)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x16x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x16x32 TN F32+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x16x32_F32E4M3E4M3_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[8];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
-      uint64_t const& desc_b,
-      float         & d0, float         & d1, float         & d2, float         & d3,
-      float         & d4, float         & d5, float         & d6, float         & d7,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %13, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n16k32.f32.e4m3.e4m3 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
-      "{%8,  %9,  %10, %11},"
-      " %12,"
-      " p,   %14, %15;\n"
-    "}\n"
-      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3),
-        "+f"(d4), "+f"(d5), "+f"(d6), "+f"(d7)
-      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x16x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x32x32 TN F16+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x32x32_F16E4M3E4M3_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[8];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
-      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %10, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n32k32.f16.e4m3.e4m3 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
-      " %8,"
-      " %9,"
-      " p,   %11, %12;\n"
-    "}\n"
-      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
-        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x32x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x32x32 TN F16+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x32x32_F16E4M3E4M3_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[8];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
-      uint64_t const& desc_b,
-      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
-      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %13, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n32k32.f16.e4m3.e4m3 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
-      "{%8,  %9,  %10, %11},"
-      " %12,"
-      " p,   %14, %15;\n"
-    "}\n"
-      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
-        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
-      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x32x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x32x32 TN F32+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x32x32_F32E4M3E4M3_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[16];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %18, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n32k32.f32.e4m3.e4m3 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
-      " %16,"
-      " %17,"
-      " p,   %19, %20;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x32x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x32x32 TN F32+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x32x32_F32E4M3E4M3_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[16];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %21, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n32k32.f32.e4m3.e4m3 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
-      "{%16, %17, %18, %19},"
-      " %20,"
-      " p,   %22, %23;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x32x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x48x32 TN F16+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x48x32_F16E4M3E4M3_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[12];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %14, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n48k32.f16.e4m3.e4m3 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11},"
-      " %12,"
-      " %13,"
-      " p,   %15, %16;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x48x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x48x32 TN F16+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x48x32_F16E4M3E4M3_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[12];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %17, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n48k32.f16.e4m3.e4m3 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11},"
-      "{%12, %13, %14, %15},"
-      " %16,"
-      " p,   %18, %19;\n"
+      "setp.ne.b32 p, %66, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n128k32.s32.s8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      " %64,"
+      " %65,"
+      " p;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x48x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x48x32 TN F32+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x48x32_F32E4M3E4M3_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[24];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %26, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n48k32.f32.e4m3.e4m3 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23},"
-      " %24,"
-      " %25,"
-      " p,   %27, %28;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
       :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x48x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x48x32 TN F32+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x48x32_F32E4M3E4M3_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[24];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %29, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n48k32.f32.e4m3.e4m3 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23},"
-      "{%24, %25, %26, %27},"
-      " %28,"
-      " p,   %30, %31;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x48x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x128x32_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x64x32 TN F16+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x64x32_F16E4M3E4M3_SS_TN
+// GMMA 64x192x32 TN S32+=S8*S8
+struct MMA_64x192x32_S32S8S8_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[16];
+  using CRegisters = uint32_t[96];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
@@ -31784,499 +5184,619 @@ struct SM90_64x64x32_F16E4M3E4M3_SS_TN
       uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
       uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
       uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
+      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %18, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n64k32.f16.e4m3.e4m3 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
-      " %16,"
-      " %17,"
-      " p,   %19, %20;\n"
+      "setp.ne.b32 p, %98, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n192k32.s32.s8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      " %96,"
+      " %97,"
+      " p;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
         "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
+        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
+        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
       :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x64x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x192x32_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x64x32 TN F16+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x64x32_F16E4M3E4M3_RS_TN
+// GMMA 64x192x32 TN S32+=S8*S8
+struct MMA_64x192x32_S32S8S8_SS_TN_SATURATE
 {
   using DRegisters = void;
-  using ARegisters = uint32_t[4];
+  using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[16];
+  using CRegisters = uint32_t[96];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
       uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
       uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
       uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
       uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
+      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %21, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n64k32.f16.e4m3.e4m3 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
-      "{%16, %17, %18, %19},"
-      " %20,"
-      " p,   %22, %23;\n"
+      "setp.ne.b32 p, %98, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n192k32.s32.s8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      " %96,"
+      " %97,"
+      " p;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
         "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
+        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
+        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
+      :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x64x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x192x32_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x64x32 TN F32+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x64x32_F32E4M3E4M3_SS_TN
+// GMMA 64x256x32 TN S32+=S8*S8
+struct MMA_64x256x32_S32S8S8_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[32];
+  using CRegisters = uint32_t[128];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
+      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %34, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n64k32.f32.e4m3.e4m3 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31},"
-      " %32,"
-      " %33,"
-      " p,   %35, %36;\n"
+      "setp.ne.b32 p, %130, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n256k32.s32.s8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      " %128,"
+      " %129,"
+      " p;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31)
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
+        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
+        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
       :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x64x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x256x32_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x64x32 TN F32+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x64x32_F32E4M3E4M3_RS_TN
+// GMMA 64x256x32 TN S32+=S8*S8
+struct MMA_64x256x32_S32S8S8_SS_TN_SATURATE
 {
   using DRegisters = void;
-  using ARegisters = uint32_t[4];
+  using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[32];
+  using CRegisters = uint32_t[128];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
+      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %37, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n64k32.f32.e4m3.e4m3 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31},"
-      "{%32, %33, %34, %35},"
-      " %36,"
-      " p,   %38, %39;\n"
+      "setp.ne.b32 p, %130, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n256k32.s32.s8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      " %128,"
+      " %129,"
+      " p;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
+        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
+        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
+      :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x64x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x256x32_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x80x32 TN F16+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x80x32_F16E4M3E4M3_SS_TN
+// GMMA 64x8x32 TN S32+=S8*S8
+struct MMA_64x8x32_S32S8S8_RS_TN
 {
   using DRegisters = void;
-  using ARegisters = uint64_t[1];
+  using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[20];
+  using CRegisters = uint32_t[4];
 
   CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
       uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %22, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n80k32.f16.e4m3.e4m3 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19},"
-      " %20,"
-      " %21,"
-      " p,   %23, %24;\n"
+      "setp.ne.b32 p, %9, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n8k32.s32.s8.s8 "
+      "{%0,  %1,  %2,  %3},"
+      "{%4,  %5,  %6,  %7},"
+      " %8,"
+      " p;\n"
     "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19)
-      :  "l"(desc_a),
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x80x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x8x32_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x80x32 TN F16+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x80x32_F16E4M3E4M3_RS_TN
+// GMMA 64x8x32 TN S32+=S8*S8
+struct MMA_64x8x32_S32S8S8_RS_TN_SATURATE
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[20];
+  using CRegisters = uint32_t[4];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
       uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %25, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n80k32.f16.e4m3.e4m3 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19},"
-      "{%20, %21, %22, %23},"
-      " %24,"
-      " p,   %26, %27;\n"
+      "setp.ne.b32 p, %9, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n8k32.s32.s8.s8.satfinite "
+      "{%0,  %1,  %2,  %3},"
+      "{%4,  %5,  %6,  %7},"
+      " %8,"
+      " p;\n"
     "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x80x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x8x32_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x80x32 TN F32+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x80x32_F32E4M3E4M3_SS_TN
+// GMMA 64x16x32 TN S32+=S8*S8
+struct MMA_64x16x32_S32S8S8_RS_TN
 {
   using DRegisters = void;
-  using ARegisters = uint64_t[1];
+  using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[40];
+  using CRegisters = uint32_t[8];
 
   CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %42, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n80k32.f32.e4m3.e4m3 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39},"
-      " %40,"
-      " %41,"
-      " p,   %43, %44;\n"
+      "setp.ne.b32 p, %13, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n16k32.s32.s8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      "{%8,  %9,  %10, %11},"
+      " %12,"
+      " p;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
-      :  "l"(desc_a),
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x80x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x16x32_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x80x32 TN F32+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x80x32_F32E4M3E4M3_RS_TN
+// GMMA 64x16x32 TN S32+=S8*S8
+struct MMA_64x16x32_S32S8S8_RS_TN_SATURATE
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[40];
+  using CRegisters = uint32_t[8];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %45, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n80k32.f32.e4m3.e4m3 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39},"
-      "{%40, %41, %42, %43},"
-      " %44,"
-      " p,   %46, %47;\n"
+      "setp.ne.b32 p, %13, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n16k32.s32.s8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      "{%8,  %9,  %10, %11},"
+      " %12,"
+      " p;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x80x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x16x32_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x96x32 TN F16+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x96x32_F16E4M3E4M3_SS_TN
+// GMMA 64x32x32 TN S32+=S8*S8
+struct MMA_64x32x32_S32S8S8_RS_TN
 {
   using DRegisters = void;
-  using ARegisters = uint64_t[1];
+  using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[24];
+  using CRegisters = uint32_t[16];
 
   CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
       uint64_t const& desc_b,
       uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
       uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
       uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
       uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %26, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n96k32.f16.e4m3.e4m3 "
+      "setp.ne.b32 p, %21, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n32k32.s32.s8.s8 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23},"
-      " %24,"
-      " %25,"
-      " p,   %27, %28;\n"
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      "{%16, %17, %18, %19},"
+      " %20,"
+      " p;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
         "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
-      :  "l"(desc_a),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x96x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x32x32_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x96x32 TN F16+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x96x32_F16E4M3E4M3_RS_TN
+// GMMA 64x32x32 TN S32+=S8*S8
+struct MMA_64x32x32_S32S8S8_RS_TN_SATURATE
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[24];
+  using CRegisters = uint32_t[16];
 
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
@@ -32285,191 +5805,156 @@ struct SM90_64x96x32_F16E4M3E4M3_RS_TN
       uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
       uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
       uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %29, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n96k32.f16.e4m3.e4m3 "
+      "setp.ne.b32 p, %21, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n32k32.s32.s8.s8.satfinite "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23},"
-      "{%24, %25, %26, %27},"
-      " %28,"
-      " p,   %30, %31;\n"
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      "{%16, %17, %18, %19},"
+      " %20,"
+      " p;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
         "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x96x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x32x32_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x96x32 TN F32+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x96x32_F32E4M3E4M3_SS_TN
+// GMMA 64x64x32 TN S32+=S8*S8
+struct MMA_64x64x32_S32S8S8_RS_TN
 {
   using DRegisters = void;
-  using ARegisters = uint64_t[1];
+  using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[48];
+  using CRegisters = uint32_t[32];
 
   CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %50, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n96k32.f32.e4m3.e4m3 "
+      "setp.ne.b32 p, %37, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n64k32.s32.s8.s8 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
       " %8,  %9,  %10, %11, %12, %13, %14, %15, "
       " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39, "
-      " %40, %41, %42, %43, %44, %45, %46, %47},"
-      " %48,"
-      " %49,"
-      " p,   %51, %52;\n"
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      "{%32, %33, %34, %35},"
+      " %36,"
+      " p;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47)
-      :  "l"(desc_a),
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x96x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x64x32_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x96x32 TN F32+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x96x32_F32E4M3E4M3_RS_TN
+// GMMA 64x64x32 TN S32+=S8*S8
+struct MMA_64x64x32_S32S8S8_RS_TN_SATURATE
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[48];
+  using CRegisters = uint32_t[32];
 
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %53, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n96k32.f32.e4m3.e4m3 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
-      "{%48,  %49,  %50,  %51},"
-      " %52,"
-      " p,    %54,  %55;\n"
+      "setp.ne.b32 p, %37, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n64k32.s32.s8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      "{%32, %33, %34, %35},"
+      " %36,"
+      " p;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47)
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x96x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x64x32_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x112x32 TN F16+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x112x32_F16E4M3E4M3_SS_TN
+// GMMA 64x96x32 TN S32+=S8*S8
+struct MMA_64x96x32_S32S8S8_RS_TN
 {
   using DRegisters = void;
-  using ARegisters = uint64_t[1];
+  using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[28];
+  using CRegisters = uint32_t[48];
 
   CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
       uint64_t const& desc_b,
       uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
       uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
@@ -32478,21 +5963,29 @@ struct SM90_64x112x32_F16E4M3E4M3_SS_TN
       uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
       uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
       uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %30, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n112k32.f16.e4m3.e4m3 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27},"
-      " %28,"
-      " %29,"
-      " p,   %31, %32;\n"
+      "setp.ne.b32 p, %53, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n96k32.s32.s8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
+      "{%48,  %49,  %50,  %51},"
+      " %52,"
+      " p;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -32500,31 +5993,30 @@ struct SM90_64x112x32_F16E4M3E4M3_SS_TN
         "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
         "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
         "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27)
-      :  "l"(desc_a),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x112x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x96x32_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x112x32 TN F16+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x112x32_F16E4M3E4M3_RS_TN
+// GMMA 64x96x32 TN S32+=S8*S8
+struct MMA_64x96x32_S32S8S8_RS_TN_SATURATE
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[28];
+  using CRegisters = uint32_t[48];
 
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
@@ -32536,21 +6028,29 @@ struct SM90_64x112x32_F16E4M3E4M3_RS_TN
       uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
       uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
       uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %33, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n112k32.f16.e4m3.e4m3 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27},"
-      "{%28, %29, %30, %31},"
-      " %32,"
-      " p,   %34, %35;\n"
+      "setp.ne.b32 p, %53, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n96k32.s32.s8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
+      "{%48,  %49,  %50,  %51},"
+      " %52,"
+      " p;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -32558,183 +6058,183 @@ struct SM90_64x112x32_F16E4M3E4M3_RS_TN
         "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
         "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
         "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27)
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x112x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x96x32_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x112x32 TN F32+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x112x32_F32E4M3E4M3_SS_TN
+// GMMA 64x128x32 TN S32+=S8*S8
+struct MMA_64x128x32_S32S8S8_RS_TN
 {
   using DRegisters = void;
-  using ARegisters = uint64_t[1];
+  using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[56];
+  using CRegisters = uint32_t[64];
 
   CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %58, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n112k32.f32.e4m3.e4m3 "
+      "setp.ne.b32 p, %69, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n128k32.s32.s8.s8 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
       " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
       " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
-      " %56,"
-      " %57,"
-      " p,    %59,  %60;\n"
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      "{%64,  %65,  %66,  %67},"
+      " %68,"
+      " p;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
-      :  "l"(desc_a),
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x112x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x128x32_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x112x32 TN F32+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x112x32_F32E4M3E4M3_RS_TN
+// GMMA 64x128x32 TN S32+=S8*S8
+struct MMA_64x128x32_S32S8S8_RS_TN_SATURATE
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[56];
+  using CRegisters = uint32_t[64];
 
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %61, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n112k32.f32.e4m3.e4m3 "
+      "setp.ne.b32 p, %69, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n128k32.s32.s8.s8.satfinite "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
       " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
       " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
-      "{%56,  %57,  %58,  %59},"
-      " %60,"
-      " p,    %62,  %63;\n"
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      "{%64,  %65,  %66,  %67},"
+      " %68,"
+      " p;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x112x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x128x32_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x128x32 TN F16+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x128x32_F16E4M3E4M3_SS_TN
+// GMMA 64x192x32 TN S32+=S8*S8
+struct MMA_64x192x32_S32S8S8_RS_TN
 {
   using DRegisters = void;
-  using ARegisters = uint64_t[1];
+  using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[32];
+  using CRegisters = uint32_t[96];
 
   CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
       uint64_t const& desc_b,
       uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
       uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
@@ -32744,21 +6244,46 @@ struct SM90_64x128x32_F16E4M3E4M3_SS_TN
       uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
       uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
       uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
+      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %34, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n128k32.f16.e4m3.e4m3 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31},"
-      " %32,"
-      " %33,"
-      " p,   %35, %36;\n"
+      "setp.ne.b32 p, %101, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n192k32.s32.s8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      "{%96,  %97,  %98,  %99},"
+      " %100,"
+      " p;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -32767,29 +6292,41 @@ struct SM90_64x128x32_F16E4M3E4M3_SS_TN
         "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
         "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
         "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
-      :  "l"(desc_a),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
+        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
+        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x128x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x192x32_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x128x32 TN F16+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x128x32_F16E4M3E4M3_RS_TN
+// GMMA 64x192x32 TN S32+=S8*S8
+struct MMA_64x192x32_S32S8S8_RS_TN_SATURATE
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[32];
+  using CRegisters = uint32_t[96];
 
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
@@ -32802,21 +6339,46 @@ struct SM90_64x128x32_F16E4M3E4M3_RS_TN
       uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
       uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
       uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
+      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %37, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n128k32.f16.e4m3.e4m3 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31},"
-      "{%32, %33, %34, %35},"
-      " %36,"
-      " p,   %38, %39;\n"
+      "setp.ne.b32 p, %101, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n192k32.s32.s8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      "{%96,  %97,  %98,  %99},"
+      " %100,"
+      " p;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -32825,57 +6387,86 @@ struct SM90_64x128x32_F16E4M3E4M3_RS_TN
         "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
         "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
         "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
+        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
+        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x128x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x192x32_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x128x32 TN F32+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x128x32_F32E4M3E4M3_SS_TN
+// GMMA 64x256x32 TN S32+=S8*S8
+struct MMA_64x256x32_S32S8S8_RS_TN
 {
   using DRegisters = void;
-  using ARegisters = uint64_t[1];
+  using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[64];
+  using CRegisters = uint32_t[128];
 
   CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
+      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %66, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n128k32.f32.e4m3.e4m3 "
+      "setp.ne.b32 p, %133, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n256k32.s32.s8.s8 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -32883,77 +6474,114 @@ struct SM90_64x128x32_F32E4M3E4M3_SS_TN
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
       " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
       " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
-      " %64,"
-      " %65,"
-      " p,    %67,  %68;\n"
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      "{%128, %129, %130, %131},"
+      " %132,"
+      " p;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
-      :  "l"(desc_a),
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
+        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
+        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x128x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x256x32_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x128x32 TN F32+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x128x32_F32E4M3E4M3_RS_TN
+// GMMA 64x256x32 TN S32+=S8*S8
+struct MMA_64x256x32_S32S8S8_RS_TN_SATURATE
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[64];
+  using CRegisters = uint32_t[128];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
+      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %69, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n128k32.f32.e4m3.e4m3 "
+      "setp.ne.b32 p, %133, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n256k32.s32.s8.s8.satfinite "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -32961,346 +6589,225 @@ struct SM90_64x128x32_F32E4M3E4M3_RS_TN
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
       " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
       " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
-      "{%64,  %65,  %66,  %67},"
-      " %68,"
-      " p,    %70,  %71;\n"
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      "{%128, %129, %130, %131},"
+      " %132,"
+      " p;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
+        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
+        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x128x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x256x32_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x144x32 TN F16+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x144x32_F16E4M3E4M3_SS_TN
+// GMMA 64x8x32 TN S32+=S8*U8
+struct MMA_64x8x32_S32S8U8_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[36];
+  using CRegisters = uint32_t[4];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %38, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n144k32.f16.e4m3.e4m3 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35},"
-      " %36,"
-      " %37,"
-      " p,   %39, %40;\n"
+      "setp.ne.b32 p, %6, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n8k32.s32.s8.u8 "
+      "{%0,  %1,  %2,  %3},"
+      " %4,"
+      " %5,"
+      " p;\n"
     "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35)
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
       :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x144x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x8x32_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x144x32 TN F16+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x144x32_F16E4M3E4M3_RS_TN
+// GMMA 64x8x32 TN S32+=S8*U8
+struct MMA_64x8x32_S32S8U8_SS_TN_SATURATE
 {
   using DRegisters = void;
-  using ARegisters = uint32_t[4];
+  using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[36];
+  using CRegisters = uint32_t[4];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %41, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n144k32.f16.e4m3.e4m3 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35},"
-      "{%36, %37, %38, %39},"
-      " %40,"
-      " p,   %42, %43;\n"
+      "setp.ne.b32 p, %6, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n8k32.s32.s8.u8.satfinite "
+      "{%0,  %1,  %2,  %3},"
+      " %4,"
+      " %5,"
+      " p;\n"
     "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
+      :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x144x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x8x32_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x144x32 TN F32+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x144x32_F32E4M3E4M3_SS_TN
+// GMMA 64x16x32 TN S32+=S8*U8
+struct MMA_64x16x32_S32S8U8_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[72];
+  using CRegisters = uint32_t[8];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %74, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n144k32.f32.e4m3.e4m3 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
-      " %72,"
-      " %73,"
-      " p,    %75,  %76;\n"
+      "setp.ne.b32 p, %10, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n16k32.s32.s8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      " %8,"
+      " %9,"
+      " p;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71)
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
       :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x144x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x16x32_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x144x32 TN F32+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x144x32_F32E4M3E4M3_RS_TN
+// GMMA 64x16x32 TN S32+=S8*U8
+struct MMA_64x16x32_S32S8U8_SS_TN_SATURATE
 {
   using DRegisters = void;
-  using ARegisters = uint32_t[4];
+  using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[72];
+  using CRegisters = uint32_t[8];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %77, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n144k32.f32.e4m3.e4m3 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
-      "{%72,  %73,  %74,  %75},"
-      " %76,"
-      " p,    %78,  %79;\n"
+      "setp.ne.b32 p, %10, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n16k32.s32.s8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      " %8,"
+      " %9,"
+      " p;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
+      :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x144x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x16x32_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x160x32 TN F16+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x160x32_F16E4M3E4M3_SS_TN
+// GMMA 64x32x32 TN S32+=S8*U8
+struct MMA_64x32x32_S32S8U8_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[40];
+  using CRegisters = uint32_t[16];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
@@ -33309,308 +6816,198 @@ struct SM90_64x160x32_F16E4M3E4M3_SS_TN
       uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
       uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
       uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %42, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n160k32.f16.e4m3.e4m3 "
+      "setp.ne.b32 p, %18, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n32k32.s32.s8.u8 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39},"
-      " %40,"
-      " %41,"
-      " p,   %43, %44;\n"
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      " %16,"
+      " %17,"
+      " p;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
         "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
       :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x160x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x32x32_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x160x32 TN F16+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x160x32_F16E4M3E4M3_RS_TN
+// GMMA 64x32x32 TN S32+=S8*U8
+struct MMA_64x32x32_S32S8U8_SS_TN_SATURATE
 {
   using DRegisters = void;
-  using ARegisters = uint32_t[4];
+  using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[40];
+  using CRegisters = uint32_t[16];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
       uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
       uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
       uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
       uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %45, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n160k32.f16.e4m3.e4m3 "
+      "setp.ne.b32 p, %18, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n32k32.s32.s8.u8.satfinite "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39},"
-      "{%40, %41, %42, %43},"
-      " %44,"
-      " p,   %46, %47;\n"
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      " %16,"
+      " %17,"
+      " p;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
         "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
+      :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x160x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x32x32_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x160x32 TN F32+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x160x32_F32E4M3E4M3_SS_TN
+// GMMA 64x64x32 TN S32+=S8*U8
+struct MMA_64x64x32_S32S8U8_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[80];
+  using CRegisters = uint32_t[32];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      float         & d72, float         & d73, float         & d74, float         & d75,
-      float         & d76, float         & d77, float         & d78, float         & d79,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %82, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n160k32.f32.e4m3.e4m3 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
-      " %80,"
-      " %81,"
-      " p,    %83,  %84;\n"
+      "setp.ne.b32 p, %34, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n64k32.s32.s8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      " %32,"
+      " %33,"
+      " p;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
-        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
-        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79)
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
       :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x160x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x64x32_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x160x32 TN F32+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x160x32_F32E4M3E4M3_RS_TN
+// GMMA 64x64x32 TN S32+=S8*U8
+struct MMA_64x64x32_S32S8U8_SS_TN_SATURATE
 {
   using DRegisters = void;
-  using ARegisters = uint32_t[4];
+  using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[80];
+  using CRegisters = uint32_t[32];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      float         & d72, float         & d73, float         & d74, float         & d75,
-      float         & d76, float         & d77, float         & d78, float         & d79,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %85, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n160k32.f32.e4m3.e4m3 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
-      "{%80,  %81,  %82,  %83},"
-      " %84,"
-      " p,    %86,  %87;\n"
+      "setp.ne.b32 p, %34, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n64k32.s32.s8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      " %32,"
+      " %33,"
+      " p;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
-        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
-        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
+      :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x160x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x64x32_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x176x32 TN F16+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x176x32_F16E4M3E4M3_SS_TN
+// GMMA 64x96x32 TN S32+=S8*U8
+struct MMA_64x96x32_S32S8U8_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[44];
+  using CRegisters = uint32_t[48];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
@@ -33626,23 +7023,25 @@ struct SM90_64x176x32_F16E4M3E4M3_SS_TN
       uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
       uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
       uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %46, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n176k32.f16.e4m3.e4m3 "
+      "setp.ne.b32 p, %50, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n96k32.s32.s8.u8 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
       " %8,  %9,  %10, %11, %12, %13, %14, %15, "
       " %16, %17, %18, %19, %20, %21, %22, %23, "
       " %24, %25, %26, %27, %28, %29, %30, %31, "
       " %32, %33, %34, %35, %36, %37, %38, %39, "
-      " %40, %41, %42, %43},"
-      " %44,"
-      " %45,"
-      " p,   %47, %48;\n"
+      " %40, %41, %42, %43, %44, %45, %46, %47},"
+      " %48,"
+      " %49,"
+      " p;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -33654,34 +7053,29 @@ struct SM90_64x176x32_F16E4M3E4M3_SS_TN
         "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
         "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
         "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43)
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
       :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x176x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x96x32_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x176x32 TN F16+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x176x32_F16E4M3E4M3_RS_TN
+// GMMA 64x96x32 TN S32+=S8*U8
+struct MMA_64x96x32_S32S8U8_SS_TN_SATURATE
 {
   using DRegisters = void;
-  using ARegisters = uint32_t[4];
+  using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[44];
+  using CRegisters = uint32_t[48];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
       uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
       uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
@@ -33694,23 +7088,25 @@ struct SM90_64x176x32_F16E4M3E4M3_RS_TN
       uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
       uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
       uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %49, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n176k32.f16.e4m3.e4m3 "
+      "setp.ne.b32 p, %50, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n96k32.s32.s8.u8.satfinite "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
       " %8,  %9,  %10, %11, %12, %13, %14, %15, "
       " %16, %17, %18, %19, %20, %21, %22, %23, "
       " %24, %25, %26, %27, %28, %29, %30, %31, "
       " %32, %33, %34, %35, %36, %37, %38, %39, "
-      " %40, %41, %42, %43},"
-      "{%44, %45, %46, %47},"
+      " %40, %41, %42, %43, %44, %45, %46, %47},"
       " %48,"
-      " p,   %50, %51;\n"
+      " %49,"
+      " p;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -33722,65 +7118,55 @@ struct SM90_64x176x32_F16E4M3E4M3_RS_TN
         "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
         "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
         "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
+      :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x176x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x96x32_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x176x32 TN F32+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x176x32_F32E4M3E4M3_SS_TN
+// GMMA 64x128x32 TN S32+=S8*U8
+struct MMA_64x128x32_S32S8U8_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[88];
+  using CRegisters = uint32_t[64];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      float         & d72, float         & d73, float         & d74, float         & d75,
-      float         & d76, float         & d77, float         & d78, float         & d79,
-      float         & d80, float         & d81, float         & d82, float         & d83,
-      float         & d84, float         & d85, float         & d86, float         & d87,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %90, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n176k32.f32.e4m3.e4m3 "
+      "setp.ne.b32 p, %66, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n128k32.s32.s8.u8 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -33788,94 +7174,74 @@ struct SM90_64x176x32_F32E4M3E4M3_SS_TN
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
       " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
       " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
-      " %88,"
-      " %89,"
-      " p,    %91,  %92;\n"
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      " %64,"
+      " %65,"
+      " p;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
-        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
-        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
-        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
-        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87)
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
       :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x176x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x128x32_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x176x32 TN F32+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x176x32_F32E4M3E4M3_RS_TN
+// GMMA 64x128x32 TN S32+=S8*U8
+struct MMA_64x128x32_S32S8U8_SS_TN_SATURATE
 {
   using DRegisters = void;
-  using ARegisters = uint32_t[4];
+  using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[88];
+  using CRegisters = uint32_t[64];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      float         & d72, float         & d73, float         & d74, float         & d75,
-      float         & d76, float         & d77, float         & d78, float         & d79,
-      float         & d80, float         & d81, float         & d82, float         & d83,
-      float         & d84, float         & d85, float         & d86, float         & d87,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %93, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n176k32.f32.e4m3.e4m3 "
+      "setp.ne.b32 p, %66, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n128k32.s32.s8.u8.satfinite "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -33883,59 +7249,45 @@ struct SM90_64x176x32_F32E4M3E4M3_RS_TN
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
       " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
       " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
-      "{%88,  %89,  %90,  %91},"
-      " %92,"
-      " p,    %94,  %95;\n"
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      " %64,"
+      " %65,"
+      " p;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
-        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
-        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
-        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
-        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
+      :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x176x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x128x32_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x192x32 TN F16+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x192x32_F16E4M3E4M3_SS_TN
+// GMMA 64x192x32 TN S32+=S8*U8
+struct MMA_64x192x32_S32S8U8_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[48];
+  using CRegisters = uint32_t[96];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
@@ -33952,23 +7304,42 @@ struct SM90_64x192x32_F16E4M3E4M3_SS_TN
       uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
       uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
       uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
+      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %50, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n192k32.f16.e4m3.e4m3 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39, "
-      " %40, %41, %42, %43, %44, %45, %46, %47},"
-      " %48,"
-      " %49,"
-      " p,   %51, %52;\n"
+      "setp.ne.b32 p, %98, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n192k32.s32.s8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      " %96,"
+      " %97,"
+      " p;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -33981,32 +7352,40 @@ struct SM90_64x192x32_F16E4M3E4M3_SS_TN
         "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
         "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
         "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
+        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
+        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
       :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x192x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x192x32_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x192x32 TN F16+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x192x32_F16E4M3E4M3_RS_TN
+// GMMA 64x192x32 TN S32+=S8*U8
+struct MMA_64x192x32_S32S8U8_SS_TN_SATURATE
 {
   using DRegisters = void;
-  using ARegisters = uint32_t[4];
+  using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[48];
+  using CRegisters = uint32_t[96];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
       uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
       uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
@@ -34020,23 +7399,42 @@ struct SM90_64x192x32_F16E4M3E4M3_RS_TN
       uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
       uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
       uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
+      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %53, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n192k32.f16.e4m3.e4m3 "
+      "setp.ne.b32 p, %98, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n192k32.s32.s8.u8.satfinite "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
       " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
-      "{%48,  %49,  %50,  %51},"
-      " %52,"
-      " p,    %54,  %55;\n"
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      " %96,"
+      " %97,"
+      " p;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -34049,65 +7447,82 @@ struct SM90_64x192x32_F16E4M3E4M3_RS_TN
         "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
         "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
         "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
+        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
+        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
+      :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x192x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x192x32_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x192x32 TN F32+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x192x32_F32E4M3E4M3_SS_TN
+// GMMA 64x256x32 TN S32+=S8*U8
+struct MMA_64x256x32_S32S8U8_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[96];
+  using CRegisters = uint32_t[128];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      float         & d72, float         & d73, float         & d74, float         & d75,
-      float         & d76, float         & d77, float         & d78, float         & d79,
-      float         & d80, float         & d81, float         & d82, float         & d83,
-      float         & d84, float         & d85, float         & d86, float         & d87,
-      float         & d88, float         & d89, float         & d90, float         & d91,
-      float         & d92, float         & d93, float         & d94, float         & d95,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
+      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %98, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n192k32.f32.e4m3.e4m3 "
+      "setp.ne.b32 p, %130, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n256k32.s32.s8.u8 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -34119,93 +7534,110 @@ struct SM90_64x192x32_F32E4M3E4M3_SS_TN
       " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
       " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
       " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
-      " %96,"
-      " %97,"
-      " p,    %99,  %100;\n"
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      " %128,"
+      " %129,"
+      " p;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
-        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
-        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
-        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
-        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
-        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91),
-        "+f"(d92), "+f"(d93), "+f"(d94), "+f"(d95)
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
+        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
+        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
       :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x192x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x256x32_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x192x32 TN F32+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x192x32_F32E4M3E4M3_RS_TN
+// GMMA 64x256x32 TN S32+=S8*U8
+struct MMA_64x256x32_S32S8U8_SS_TN_SATURATE
 {
   using DRegisters = void;
-  using ARegisters = uint32_t[4];
+  using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[96];
+  using CRegisters = uint32_t[128];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      float         & d72, float         & d73, float         & d74, float         & d75,
-      float         & d76, float         & d77, float         & d78, float         & d79,
-      float         & d80, float         & d81, float         & d82, float         & d83,
-      float         & d84, float         & d85, float         & d86, float         & d87,
-      float         & d88, float         & d89, float         & d90, float         & d91,
-      float         & d92, float         & d93, float         & d94, float         & d95,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
+      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %101, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n192k32.f32.e4m3.e4m3 "
+      "setp.ne.b32 p, %130, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n256k32.s32.s8.u8.satfinite "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -34217,489 +7649,266 @@ struct SM90_64x192x32_F32E4M3E4M3_RS_TN
       " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
       " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
       " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
-      "{%96,  %97,  %98,  %99},"
-      " %100,"
-      " p,    %102, %103;\n"
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      " %128,"
+      " %129,"
+      " p;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
-        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
-        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
-        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
-        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
-        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91),
-        "+f"(d92), "+f"(d93), "+f"(d94), "+f"(d95)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
+        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
+        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
+      :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x192x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x256x32_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x208x32 TN F16+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x208x32_F16E4M3E4M3_SS_TN
+// GMMA 64x8x32 TN S32+=S8*U8
+struct MMA_64x8x32_S32S8U8_RS_TN
 {
   using DRegisters = void;
-  using ARegisters = uint64_t[1];
+  using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[52];
+  using CRegisters = uint32_t[4];
 
   CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
       uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %54, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n208k32.f16.e4m3.e4m3 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51},"
-      " %52,"
-      " %53,"
-      " p,    %55,  %56;\n"
+      "setp.ne.b32 p, %9, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n8k32.s32.s8.u8 "
+      "{%0,  %1,  %2,  %3},"
+      "{%4,  %5,  %6,  %7},"
+      " %8,"
+      " p;\n"
     "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51)
-      :  "l"(desc_a),
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x208x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x8x32_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x208x32 TN F16+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x208x32_F16E4M3E4M3_RS_TN
+// GMMA 64x8x32 TN S32+=S8*U8
+struct MMA_64x8x32_S32S8U8_RS_TN_SATURATE
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[52];
+  using CRegisters = uint32_t[4];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
       uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %57, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n208k32.f16.e4m3.e4m3 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51},"
-      "{%52,  %53,  %54,  %55},"
-      " %56,"
-      " p,    %58,  %59;\n"
+      "setp.ne.b32 p, %9, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n8k32.s32.s8.u8.satfinite "
+      "{%0,  %1,  %2,  %3},"
+      "{%4,  %5,  %6,  %7},"
+      " %8,"
+      " p;\n"
     "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x208x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x8x32_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x208x32 TN F32+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x208x32_F32E4M3E4M3_SS_TN
+// GMMA 64x16x32 TN S32+=S8*U8
+struct MMA_64x16x32_S32S8U8_RS_TN
 {
   using DRegisters = void;
-  using ARegisters = uint64_t[1];
+  using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[104];
+  using CRegisters = uint32_t[8];
 
   CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
       uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %106, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n208k32.f32.e4m3.e4m3 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
-      " %104,"
-      " %105,"
-      " p,    %107, %108;\n"
+      "setp.ne.b32 p, %13, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n16k32.s32.s8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      "{%8,  %9,  %10, %11},"
+      " %12,"
+      " p;\n"
     "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103)
-      :  "l"(desc_a),
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x208x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x16x32_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x208x32 TN F32+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x208x32_F32E4M3E4M3_RS_TN
+// GMMA 64x16x32 TN S32+=S8*U8
+struct MMA_64x16x32_S32S8U8_RS_TN_SATURATE
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[104];
+  using CRegisters = uint32_t[8];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
       uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %109, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n208k32.f32.e4m3.e4m3 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
-      "{%104, %105, %106, %107},"
-      " %108,"
-      " p,    %110, %111;\n"
+      "setp.ne.b32 p, %13, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n16k32.s32.s8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      "{%8,  %9,  %10, %11},"
+      " %12,"
+      " p;\n"
     "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x208x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x16x32_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x224x32 TN F16+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x224x32_F16E4M3E4M3_SS_TN
+// GMMA 64x32x32 TN S32+=S8*U8
+struct MMA_64x32x32_S32S8U8_RS_TN
 {
   using DRegisters = void;
-  using ARegisters = uint64_t[1];
+  using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[56];
+  using CRegisters = uint32_t[16];
 
   CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
       uint64_t const& desc_b,
       uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
       uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
       uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
       uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %58, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n224k32.f16.e4m3.e4m3 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
-      " %56,"
-      " %57,"
-      " p,    %59,  %60;\n"
+      "setp.ne.b32 p, %21, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n32k32.s32.s8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      "{%16, %17, %18, %19},"
+      " %20,"
+      " p;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
         "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
-      :  "l"(desc_a),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x224x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x32x32_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x224x32 TN F16+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x224x32_F16E4M3E4M3_RS_TN
+// GMMA 64x32x32 TN S32+=S8*U8
+struct MMA_64x32x32_S32S8U8_RS_TN_SATURATE
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[56];
+  using CRegisters = uint32_t[16];
 
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
@@ -34708,296 +7917,156 @@ struct SM90_64x224x32_F16E4M3E4M3_RS_TN
       uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
       uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
       uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %61, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n224k32.f16.e4m3.e4m3 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
-      "{%56,  %57,  %58,  %59},"
-      " %60,"
-      " p,    %62,  %63;\n"
+      "setp.ne.b32 p, %21, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n32k32.s32.s8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      "{%16, %17, %18, %19},"
+      " %20,"
+      " p;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
         "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x224x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x32x32_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x224x32 TN F32+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x224x32_F32E4M3E4M3_SS_TN
+// GMMA 64x64x32 TN S32+=S8*U8
+struct MMA_64x64x32_S32S8U8_RS_TN
 {
   using DRegisters = void;
-  using ARegisters = uint64_t[1];
+  using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[112];
+  using CRegisters = uint32_t[32];
 
   CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
       uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
-      float         & d104, float         & d105, float         & d106, float         & d107,
-      float         & d108, float         & d109, float         & d110, float         & d111,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %114, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n224k32.f32.e4m3.e4m3 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111},"
-      " %112,"
-      " %113,"
-      " p,    %115, %116;\n"
+      "setp.ne.b32 p, %37, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n64k32.s32.s8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      "{%32, %33, %34, %35},"
+      " %36,"
+      " p;\n"
     "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
-        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
-        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111)
-      :  "l"(desc_a),
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x224x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x64x32_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x224x32 TN F32+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x224x32_F32E4M3E4M3_RS_TN
+// GMMA 64x64x32 TN S32+=S8*U8
+struct MMA_64x64x32_S32S8U8_RS_TN_SATURATE
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[112];
+  using CRegisters = uint32_t[32];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
       uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
-      float         & d104, float         & d105, float         & d106, float         & d107,
-      float         & d108, float         & d109, float         & d110, float         & d111,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %117, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n224k32.f32.e4m3.e4m3 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111},"
-      "{%112, %113, %114, %115},"
-      " %116,"
-      " p,    %118, %119;\n"
+      "setp.ne.b32 p, %37, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n64k32.s32.s8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      "{%32, %33, %34, %35},"
+      " %36,"
+      " p;\n"
     "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
-        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
-        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x224x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x64x32_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x240x32 TN F16+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x240x32_F16E4M3E4M3_SS_TN
+// GMMA 64x96x32 TN S32+=S8*U8
+struct MMA_64x96x32_S32S8U8_RS_TN
 {
   using DRegisters = void;
-  using ARegisters = uint64_t[1];
+  using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[60];
+  using CRegisters = uint32_t[48];
 
   CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
       uint64_t const& desc_b,
       uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
       uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
@@ -35011,28 +8080,24 @@ struct SM90_64x240x32_F16E4M3E4M3_SS_TN
       uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
       uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
       uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %62, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n240k32.f16.e4m3.e4m3 "
+      "setp.ne.b32 p, %53, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n96k32.s32.s8.u8 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
       " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59},"
-      " %60,"
-      " %61,"
-      " p,    %63,  %64;\n"
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
+      "{%48,  %49,  %50,  %51},"
+      " %52,"
+      " p;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -35045,34 +8110,25 @@ struct SM90_64x240x32_F16E4M3E4M3_SS_TN
         "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
         "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
         "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59)
-      :  "l"(desc_a),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x240x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x96x32_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x240x32 TN F16+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x240x32_F16E4M3E4M3_RS_TN
+// GMMA 64x96x32 TN S32+=S8*U8
+struct MMA_64x96x32_S32S8U8_RS_TN_SATURATE
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[60];
+  using CRegisters = uint32_t[48];
 
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
@@ -35089,28 +8145,24 @@ struct SM90_64x240x32_F16E4M3E4M3_RS_TN
       uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
       uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
       uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %65, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n240k32.f16.e4m3.e4m3 "
+      "setp.ne.b32 p, %53, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n96k32.s32.s8.u8.satfinite "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
       " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59},"
-      "{%60,  %61,  %62,  %63},"
-      " %64,"
-      " p,    %66,  %67;\n"
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
+      "{%48,  %49,  %50,  %51},"
+      " %52,"
+      " p;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -35123,76 +8175,54 @@ struct SM90_64x240x32_F16E4M3E4M3_RS_TN
         "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
         "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
         "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59)
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x240x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x96x32_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x240x32 TN F32+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x240x32_F32E4M3E4M3_SS_TN
+// GMMA 64x128x32 TN S32+=S8*U8
+struct MMA_64x128x32_S32S8U8_RS_TN
 {
   using DRegisters = void;
-  using ARegisters = uint64_t[1];
+  using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[120];
+  using CRegisters = uint32_t[64];
 
   CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
       uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
-      float         & d104, float         & d105, float         & d106, float         & d107,
-      float         & d108, float         & d109, float         & d110, float         & d111,
-      float         & d112, float         & d113, float         & d114, float         & d115,
-      float         & d116, float         & d117, float         & d118, float         & d119,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %122, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n240k32.f32.e4m3.e4m3 "
+      "setp.ne.b32 p, %69, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n128k32.s32.s8.u8 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -35200,114 +8230,74 @@ struct SM90_64x240x32_F32E4M3E4M3_SS_TN
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
       " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
       " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119},"
-      " %120,"
-      " %121,"
-      " p,    %123, %124;\n"
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      "{%64,  %65,  %66,  %67},"
+      " %68,"
+      " p;\n"
     "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
-        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
-        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
-        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
-        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119)
-      :  "l"(desc_a),
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x240x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x128x32_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x240x32 TN F32+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x240x32_F32E4M3E4M3_RS_TN
+// GMMA 64x128x32 TN S32+=S8*U8
+struct MMA_64x128x32_S32S8U8_RS_TN_SATURATE
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[120];
+  using CRegisters = uint32_t[64];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
       uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
-      float         & d104, float         & d105, float         & d106, float         & d107,
-      float         & d108, float         & d109, float         & d110, float         & d111,
-      float         & d112, float         & d113, float         & d114, float         & d115,
-      float         & d116, float         & d117, float         & d118, float         & d119,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %125, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n240k32.f32.e4m3.e4m3 "
+      "setp.ne.b32 p, %69, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n128k32.s32.s8.u8.satfinite "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -35315,74 +8305,48 @@ struct SM90_64x240x32_F32E4M3E4M3_RS_TN
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
       " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
       " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119},"
-      "{%120, %121, %122, %123},"
-      " %124,"
-      " p,    %126, %127;\n"
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      "{%64,  %65,  %66,  %67},"
+      " %68,"
+      " p;\n"
     "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
-        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
-        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
-        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
-        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x240x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x128x32_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x256x32 TN F16+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x256x32_F16E4M3E4M3_SS_TN
+// GMMA 64x192x32 TN S32+=S8*U8
+struct MMA_64x192x32_S32S8U8_RS_TN
 {
   using DRegisters = void;
-  using ARegisters = uint64_t[1];
+  using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[64];
+  using CRegisters = uint32_t[96];
 
   CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
       uint64_t const& desc_b,
       uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
       uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
@@ -35400,14 +8364,23 @@ struct SM90_64x256x32_F16E4M3E4M3_SS_TN
       uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
       uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
       uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
+      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %66, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n256k32.f16.e4m3.e4m3 "
+      "setp.ne.b32 p, %101, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n192k32.s32.s8.u8 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -35415,10 +8388,14 @@ struct SM90_64x256x32_F16E4M3E4M3_SS_TN
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
       " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
       " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
-      " %64,"
-      " %65,"
-      " p,    %67,  %68;\n"
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      "{%96,  %97,  %98,  %99},"
+      " %100,"
+      " p;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -35435,29 +8412,33 @@ struct SM90_64x256x32_F16E4M3E4M3_SS_TN
         "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
         "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
         "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
-      :  "l"(desc_a),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
+        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
+        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x256x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x192x32_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x256x32 TN F16+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x256x32_F16E4M3E4M3_RS_TN
+// GMMA 64x192x32 TN S32+=S8*U8
+struct MMA_64x192x32_S32S8U8_RS_TN_SATURATE
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[64];
+  using CRegisters = uint32_t[96];
 
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
@@ -35478,14 +8459,23 @@ struct SM90_64x256x32_F16E4M3E4M3_RS_TN
       uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
       uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
       uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
+      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %69, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n256k32.f16.e4m3.e4m3 "
+      "setp.ne.b32 p, %101, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n192k32.s32.s8.u8.satfinite "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -35493,10 +8483,14 @@ struct SM90_64x256x32_F16E4M3E4M3_RS_TN
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
       " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
       " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
-      "{%64,  %65,  %66,  %67},"
-      " %68,"
-      " p,    %70,  %71;\n"
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      "{%96,  %97,  %98,  %99},"
+      " %100,"
+      " p;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -35513,73 +8507,78 @@ struct SM90_64x256x32_F16E4M3E4M3_RS_TN
         "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
         "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
         "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
+        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
+        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x256x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x192x32_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x256x32 TN F32+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x256x32_F32E4M3E4M3_SS_TN
+// GMMA 64x256x32 TN S32+=S8*U8
+struct MMA_64x256x32_S32S8U8_RS_TN
 {
   using DRegisters = void;
-  using ARegisters = uint64_t[1];
+  using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[128];
+  using CRegisters = uint32_t[128];
 
   CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
       uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
-      float         & d104, float         & d105, float         & d106, float         & d107,
-      float         & d108, float         & d109, float         & d110, float         & d111,
-      float         & d112, float         & d113, float         & d114, float         & d115,
-      float         & d116, float         & d117, float         & d118, float         & d119,
-      float         & d120, float         & d121, float         & d122, float         & d123,
-      float         & d124, float         & d125, float         & d126, float         & d127,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
+      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %130, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n256k32.f32.e4m3.e4m3 "
+      "setp.ne.b32 p, %133, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n256k32.s32.s8.u8 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -35596,108 +8595,105 @@ struct SM90_64x256x32_F32E4M3E4M3_SS_TN
       " %104, %105, %106, %107, %108, %109, %110, %111, "
       " %112, %113, %114, %115, %116, %117, %118, %119, "
       " %120, %121, %122, %123, %124, %125, %126, %127},"
-      " %128,"
-      " %129,"
-      " p,    %131, %132;\n"
+      "{%128, %129, %130, %131},"
+      " %132,"
+      " p;\n"
     "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
-        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
-        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
-        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
-        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
-        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123),
-        "+f"(d124), "+f"(d125), "+f"(d126), "+f"(d127)
-      :  "l"(desc_a),
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
+        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
+        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x256x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x256x32_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x256x32 TN F32+=E4M3*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x256x32_F32E4M3E4M3_RS_TN
+// GMMA 64x256x32 TN S32+=S8*U8
+struct MMA_64x256x32_S32S8U8_RS_TN_SATURATE
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[128];
+  using CRegisters = uint32_t[128];
 
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
       uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
-      float         & d104, float         & d105, float         & d106, float         & d107,
-      float         & d108, float         & d109, float         & d110, float         & d111,
-      float         & d112, float         & d113, float         & d114, float         & d115,
-      float         & d116, float         & d117, float         & d118, float         & d119,
-      float         & d120, float         & d121, float         & d122, float         & d123,
-      float         & d124, float         & d125, float         & d126, float         & d127,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
+      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %133, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n256k32.f32.e4m3.e4m3 "
+      "wgmma.mma_async.sync.aligned.m64n256k32.s32.s8.u8.satfinite "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -35716,221 +8712,53 @@ struct SM90_64x256x32_F32E4M3E4M3_RS_TN
       " %120, %121, %122, %123, %124, %125, %126, %127},"
       "{%128, %129, %130, %131},"
       " %132,"
-      " p,    %134, %135;\n"
+      " p;\n"
     "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
-        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
-        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
-        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
-        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
-        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123),
-        "+f"(d124), "+f"(d125), "+f"(d126), "+f"(d127)
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
+        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
+        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
       :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x256x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x8x32 TN F16+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x8x32_F16E4M3E5M2_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[2];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d0, uint32_t      & d1,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %4, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n8k32.f16.e4m3.e5m2 "
-      "{%0, %1},"
-      " %2,"
-      " %3,"
-      " p,  %5, %6;\n"
-    "}\n"
-      : "+r"(d0), "+r"(d1)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x8x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x8x32 TN F16+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x8x32_F16E4M3E5M2_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[2];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
-      uint64_t const& desc_b,
-      uint32_t      & d0, uint32_t      & d1,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %7, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n8k32.f16.e4m3.e5m2 "
-      "{%0,  %1},"
-      "{%2,  %3,  %4,  %5},"
-      " %6,"
-      " p,   %8,  %9;\n"
-    "}\n"
-      : "+r"(d0), "+r"(d1)
-      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x8x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x8x32 TN F32+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x8x32_F32E4M3E5M2_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[4];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      float         & d0, float         & d1, float         & d2, float         & d3,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %6, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n8k32.f32.e4m3.e5m2 "
-      "{%0,  %1,  %2,  %3},"
-      " %4,"
-      " %5,"
-      " p,   %7,  %8;\n"
-    "}\n"
-      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x8x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x8x32 TN F32+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x8x32_F32E4M3E5M2_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[4];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
-      uint64_t const& desc_b,
-      float         & d0, float         & d1, float         & d2, float         & d3,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %9, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n8k32.f32.e4m3.e5m2 "
-      "{%0,  %1,  %2,  %3},"
-      "{%4,  %5,  %6,  %7},"
-      " %8,"
-      " p,   %10, %11;\n"
-    "}\n"
-      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3)
-      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x8x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x256x32_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x16x32 TN F16+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x16x32_F16E4M3E5M2_SS_TN
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x8x32 TN S32+=U8*S8
+struct MMA_64x8x32_S32U8S8_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
@@ -35944,349 +8772,354 @@ struct SM90_64x16x32_F16E4M3E5M2_SS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %6, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n16k32.f16.e4m3.e5m2 "
+      "wgmma.mma_async.sync.aligned.m64n8k32.s32.u8.s8 "
       "{%0,  %1,  %2,  %3},"
       " %4,"
       " %5,"
-      " p,   %7,  %8;\n"
+      " p;\n"
     "}\n"
       : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
       :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x16x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x8x32_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x16x32 TN F16+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x16x32_F16E4M3E5M2_RS_TN
+// GMMA 64x8x32 TN S32+=U8*S8
+struct MMA_64x8x32_S32U8S8_SS_TN_SATURATE
 {
   using DRegisters = void;
-  using ARegisters = uint32_t[4];
+  using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
   using CRegisters = uint32_t[4];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+  fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
       uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %9, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n16k32.f16.e4m3.e5m2 "
+      "setp.ne.b32 p, %6, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n8k32.s32.u8.s8.satfinite "
       "{%0,  %1,  %2,  %3},"
-      "{%4,  %5,  %6,  %7},"
-      " %8,"
-      " p,   %10, %11;\n"
+      " %4,"
+      " %5,"
+      " p;\n"
     "}\n"
       : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
-      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+      :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x16x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x8x32_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x16x32 TN F32+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x16x32_F32E4M3E5M2_SS_TN
+// GMMA 64x16x32 TN S32+=U8*S8
+struct MMA_64x16x32_S32U8S8_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[8];
+  using CRegisters = uint32_t[8];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d0, float         & d1, float         & d2, float         & d3,
-      float         & d4, float         & d5, float         & d6, float         & d7,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %10, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n16k32.f32.e4m3.e5m2 "
+      "wgmma.mma_async.sync.aligned.m64n16k32.s32.u8.s8 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
       " %8,"
       " %9,"
-      " p,   %11, %12;\n"
+      " p;\n"
     "}\n"
-      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3),
-        "+f"(d4), "+f"(d5), "+f"(d6), "+f"(d7)
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
       :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x16x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x16x32_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x16x32 TN F32+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x16x32_F32E4M3E5M2_RS_TN
+// GMMA 64x16x32 TN S32+=U8*S8
+struct MMA_64x16x32_S32U8S8_SS_TN_SATURATE
 {
   using DRegisters = void;
-  using ARegisters = uint32_t[4];
+  using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[8];
+  using CRegisters = uint32_t[8];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+  fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d0, float         & d1, float         & d2, float         & d3,
-      float         & d4, float         & d5, float         & d6, float         & d7,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %13, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n16k32.f32.e4m3.e5m2 "
+      "setp.ne.b32 p, %10, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n16k32.s32.u8.s8.satfinite "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
-      "{%8,  %9,  %10, %11},"
-      " %12,"
-      " p,   %14, %15;\n"
+      " %8,"
+      " %9,"
+      " p;\n"
     "}\n"
-      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3),
-        "+f"(d4), "+f"(d5), "+f"(d6), "+f"(d7)
-      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
+      :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x16x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x16x32_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x32x32 TN F16+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x32x32_F16E4M3E5M2_SS_TN
+// GMMA 64x32x32 TN S32+=U8*S8
+struct MMA_64x32x32_S32U8S8_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[8];
+  using CRegisters = uint32_t[16];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
-      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %10, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n32k32.f16.e4m3.e5m2 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
-      " %8,"
-      " %9,"
-      " p,   %11, %12;\n"
+      "setp.ne.b32 p, %18, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n32k32.s32.u8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      " %16,"
+      " %17,"
+      " p;\n"
     "}\n"
-      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
-        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
       :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x32x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x32x32_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x32x32 TN F16+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x32x32_F16E4M3E5M2_RS_TN
+// GMMA 64x32x32 TN S32+=U8*S8
+struct MMA_64x32x32_S32U8S8_SS_TN_SATURATE
 {
   using DRegisters = void;
-  using ARegisters = uint32_t[4];
+  using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[8];
+  using CRegisters = uint32_t[16];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+  fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
-      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %13, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n32k32.f16.e4m3.e5m2 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
-      "{%8,  %9,  %10, %11},"
-      " %12,"
-      " p,   %14, %15;\n"
+      "setp.ne.b32 p, %18, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n32k32.s32.u8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      " %16,"
+      " %17,"
+      " p;\n"
     "}\n"
-      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
-        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
-      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
+      :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x32x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x32x32_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x32x32 TN F32+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x32x32_F32E4M3E5M2_SS_TN
+// GMMA 64x64x32 TN S32+=U8*S8
+struct MMA_64x64x32_S32U8S8_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[16];
+  using CRegisters = uint32_t[32];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %18, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n32k32.f32.e4m3.e5m2 "
+      "setp.ne.b32 p, %34, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n64k32.s32.u8.s8 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
-      " %16,"
-      " %17,"
-      " p,   %19, %20;\n"
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      " %32,"
+      " %33,"
+      " p;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15)
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
       :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x32x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x64x32_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x32x32 TN F32+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x32x32_F32E4M3E5M2_RS_TN
+// GMMA 64x64x32 TN S32+=U8*S8
+struct MMA_64x64x32_S32U8S8_SS_TN_SATURATE
 {
   using DRegisters = void;
-  using ARegisters = uint32_t[4];
+  using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[16];
+  using CRegisters = uint32_t[32];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %21, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n32k32.f32.e4m3.e5m2 "
+      "setp.ne.b32 p, %34, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n64k32.s32.u8.s8.satfinite "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
-      "{%16, %17, %18, %19},"
-      " %20,"
-      " p,   %22, %23;\n"
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      " %32,"
+      " %33,"
+      " p;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
+      :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x32x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x64x32_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x48x32 TN F16+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x48x32_F16E4M3E5M2_SS_TN
+// GMMA 64x96x32 TN S32+=U8*S8
+struct MMA_64x96x32_S32U8S8_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[12];
+  using CRegisters = uint32_t[48];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
@@ -36294,204 +9127,279 @@ struct SM90_64x48x32_F16E4M3E5M2_SS_TN
       uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
       uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
       uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %14, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n48k32.f16.e4m3.e5m2 "
+      "setp.ne.b32 p, %50, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n96k32.s32.u8.s8 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11},"
-      " %12,"
-      " %13,"
-      " p,   %15, %16;\n"
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45, %46, %47},"
+      " %48,"
+      " %49,"
+      " p;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
       :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x48x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x96x32_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x48x32 TN F16+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x48x32_F16E4M3E5M2_RS_TN
+// GMMA 64x96x32 TN S32+=U8*S8
+struct MMA_64x96x32_S32U8S8_SS_TN_SATURATE
 {
   using DRegisters = void;
-  using ARegisters = uint32_t[4];
+  using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[12];
+  using CRegisters = uint32_t[48];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
       uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
       uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
       uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %17, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n48k32.f16.e4m3.e5m2 "
+      "setp.ne.b32 p, %50, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n96k32.s32.u8.s8.satfinite "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11},"
-      "{%12, %13, %14, %15},"
-      " %16,"
-      " p,   %18, %19;\n"
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45, %46, %47},"
+      " %48,"
+      " %49,"
+      " p;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
+      :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x48x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x96x32_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x48x32 TN F32+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x48x32_F32E4M3E5M2_SS_TN
+// GMMA 64x128x32 TN S32+=U8*S8
+struct MMA_64x128x32_S32U8S8_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[24];
+  using CRegisters = uint32_t[64];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %26, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n48k32.f32.e4m3.e5m2 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23},"
-      " %24,"
-      " %25,"
-      " p,   %27, %28;\n"
+      "setp.ne.b32 p, %66, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n128k32.s32.u8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      " %64,"
+      " %65,"
+      " p;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
       :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x48x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x128x32_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x48x32 TN F32+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x48x32_F32E4M3E5M2_RS_TN
+// GMMA 64x128x32 TN S32+=U8*S8
+struct MMA_64x128x32_S32U8S8_SS_TN_SATURATE
 {
   using DRegisters = void;
-  using ARegisters = uint32_t[4];
+  using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[24];
+  using CRegisters = uint32_t[64];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %29, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n48k32.f32.e4m3.e5m2 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23},"
-      "{%24, %25, %26, %27},"
-      " %28,"
-      " p,   %30, %31;\n"
+      "setp.ne.b32 p, %66, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n128k32.s32.u8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      " %64,"
+      " %65,"
+      " p;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
+      :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x48x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x128x32_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x64x32 TN F16+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x64x32_F16E4M3E5M2_SS_TN
+// GMMA 64x192x32 TN S32+=U8*S8
+struct MMA_64x192x32_S32U8S8_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[16];
+  using CRegisters = uint32_t[96];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
@@ -36500,499 +9408,619 @@ struct SM90_64x64x32_F16E4M3E5M2_SS_TN
       uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
       uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
       uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
+      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %18, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n64k32.f16.e4m3.e5m2 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
-      " %16,"
-      " %17,"
-      " p,   %19, %20;\n"
+      "setp.ne.b32 p, %98, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n192k32.s32.u8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      " %96,"
+      " %97,"
+      " p;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
         "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
+        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
+        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
       :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x64x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x192x32_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x64x32 TN F16+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x64x32_F16E4M3E5M2_RS_TN
+// GMMA 64x192x32 TN S32+=U8*S8
+struct MMA_64x192x32_S32U8S8_SS_TN_SATURATE
 {
   using DRegisters = void;
-  using ARegisters = uint32_t[4];
+  using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[16];
+  using CRegisters = uint32_t[96];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
       uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
       uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
       uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
       uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
+      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %21, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n64k32.f16.e4m3.e5m2 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
-      "{%16, %17, %18, %19},"
-      " %20,"
-      " p,   %22, %23;\n"
+      "setp.ne.b32 p, %98, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n192k32.s32.u8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      " %96,"
+      " %97,"
+      " p;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
         "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
+        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
+        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
+      :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x64x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x192x32_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x64x32 TN F32+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x64x32_F32E4M3E5M2_SS_TN
+// GMMA 64x256x32 TN S32+=U8*S8
+struct MMA_64x256x32_S32U8S8_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[32];
+  using CRegisters = uint32_t[128];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
+      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %34, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n64k32.f32.e4m3.e5m2 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31},"
-      " %32,"
-      " %33,"
-      " p,   %35, %36;\n"
+      "setp.ne.b32 p, %130, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n256k32.s32.u8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      " %128,"
+      " %129,"
+      " p;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31)
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
+        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
+        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
       :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x64x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x256x32_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x64x32 TN F32+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x64x32_F32E4M3E5M2_RS_TN
+// GMMA 64x256x32 TN S32+=U8*S8
+struct MMA_64x256x32_S32U8S8_SS_TN_SATURATE
 {
   using DRegisters = void;
-  using ARegisters = uint32_t[4];
+  using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[32];
+  using CRegisters = uint32_t[128];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
+      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %37, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n64k32.f32.e4m3.e5m2 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31},"
-      "{%32, %33, %34, %35},"
-      " %36,"
-      " p,   %38, %39;\n"
+      "setp.ne.b32 p, %130, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n256k32.s32.u8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      " %128,"
+      " %129,"
+      " p;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
+        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
+        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
+      :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x64x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x256x32_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x80x32 TN F16+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x80x32_F16E4M3E5M2_SS_TN
+// GMMA 64x8x32 TN S32+=U8*S8
+struct MMA_64x8x32_S32U8S8_RS_TN
 {
   using DRegisters = void;
-  using ARegisters = uint64_t[1];
+  using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[20];
+  using CRegisters = uint32_t[4];
 
   CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
       uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %22, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n80k32.f16.e4m3.e5m2 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19},"
-      " %20,"
-      " %21,"
-      " p,   %23, %24;\n"
+      "setp.ne.b32 p, %9, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n8k32.s32.u8.s8 "
+      "{%0,  %1,  %2,  %3},"
+      "{%4,  %5,  %6,  %7},"
+      " %8,"
+      " p;\n"
     "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19)
-      :  "l"(desc_a),
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x80x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x8x32_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x80x32 TN F16+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x80x32_F16E4M3E5M2_RS_TN
+// GMMA 64x8x32 TN S32+=U8*S8
+struct MMA_64x8x32_S32U8S8_RS_TN_SATURATE
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[20];
+  using CRegisters = uint32_t[4];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
       uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %25, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n80k32.f16.e4m3.e5m2 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19},"
-      "{%20, %21, %22, %23},"
-      " %24,"
-      " p,   %26, %27;\n"
+      "setp.ne.b32 p, %9, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n8k32.s32.u8.s8.satfinite "
+      "{%0,  %1,  %2,  %3},"
+      "{%4,  %5,  %6,  %7},"
+      " %8,"
+      " p;\n"
     "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x80x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x8x32_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x80x32 TN F32+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x80x32_F32E4M3E5M2_SS_TN
+// GMMA 64x16x32 TN S32+=U8*S8
+struct MMA_64x16x32_S32U8S8_RS_TN
 {
   using DRegisters = void;
-  using ARegisters = uint64_t[1];
+  using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[40];
+  using CRegisters = uint32_t[8];
 
   CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %42, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n80k32.f32.e4m3.e5m2 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39},"
-      " %40,"
-      " %41,"
-      " p,   %43, %44;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
-      :  "l"(desc_a),
+      "setp.ne.b32 p, %13, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n16k32.s32.u8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      "{%8,  %9,  %10, %11},"
+      " %12,"
+      " p;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x80x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x16x32_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x80x32 TN F32+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x80x32_F32E4M3E5M2_RS_TN
+// GMMA 64x16x32 TN S32+=U8*S8
+struct MMA_64x16x32_S32U8S8_RS_TN_SATURATE
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[40];
+  using CRegisters = uint32_t[8];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %45, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n80k32.f32.e4m3.e5m2 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39},"
-      "{%40, %41, %42, %43},"
-      " %44,"
-      " p,   %46, %47;\n"
+      "setp.ne.b32 p, %13, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n16k32.s32.u8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      "{%8,  %9,  %10, %11},"
+      " %12,"
+      " p;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x80x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x16x32_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x96x32 TN F16+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x96x32_F16E4M3E5M2_SS_TN
+// GMMA 64x32x32 TN S32+=U8*S8
+struct MMA_64x32x32_S32U8S8_RS_TN
 {
   using DRegisters = void;
-  using ARegisters = uint64_t[1];
+  using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[24];
+  using CRegisters = uint32_t[16];
 
   CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
       uint64_t const& desc_b,
       uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
       uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
       uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
       uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %26, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n96k32.f16.e4m3.e5m2 "
+      "setp.ne.b32 p, %21, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n32k32.s32.u8.s8 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23},"
-      " %24,"
-      " %25,"
-      " p,   %27, %28;\n"
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      "{%16, %17, %18, %19},"
+      " %20,"
+      " p;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
         "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
-      :  "l"(desc_a),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x96x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x32x32_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x96x32 TN F16+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x96x32_F16E4M3E5M2_RS_TN
+// GMMA 64x32x32 TN S32+=U8*S8
+struct MMA_64x32x32_S32U8S8_RS_TN_SATURATE
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[24];
+  using CRegisters = uint32_t[16];
 
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
@@ -37001,191 +10029,156 @@ struct SM90_64x96x32_F16E4M3E5M2_RS_TN
       uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
       uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
       uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %29, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n96k32.f16.e4m3.e5m2 "
+      "setp.ne.b32 p, %21, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n32k32.s32.u8.s8.satfinite "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23},"
-      "{%24, %25, %26, %27},"
-      " %28,"
-      " p,   %30, %31;\n"
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      "{%16, %17, %18, %19},"
+      " %20,"
+      " p;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
         "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x96x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x32x32_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x96x32 TN F32+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x96x32_F32E4M3E5M2_SS_TN
+// GMMA 64x64x32 TN S32+=U8*S8
+struct MMA_64x64x32_S32U8S8_RS_TN
 {
   using DRegisters = void;
-  using ARegisters = uint64_t[1];
+  using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[48];
+  using CRegisters = uint32_t[32];
 
   CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %50, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n96k32.f32.e4m3.e5m2 "
+      "setp.ne.b32 p, %37, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n64k32.s32.u8.s8 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
       " %8,  %9,  %10, %11, %12, %13, %14, %15, "
       " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39, "
-      " %40, %41, %42, %43, %44, %45, %46, %47},"
-      " %48,"
-      " %49,"
-      " p,   %51, %52;\n"
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      "{%32, %33, %34, %35},"
+      " %36,"
+      " p;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47)
-      :  "l"(desc_a),
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x96x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x64x32_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x96x32 TN F32+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x96x32_F32E4M3E5M2_RS_TN
+// GMMA 64x64x32 TN S32+=U8*S8
+struct MMA_64x64x32_S32U8S8_RS_TN_SATURATE
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[48];
+  using CRegisters = uint32_t[32];
 
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %53, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n96k32.f32.e4m3.e5m2 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
-      "{%48,  %49,  %50,  %51},"
-      " %52,"
-      " p,    %54,  %55;\n"
+      "setp.ne.b32 p, %37, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n64k32.s32.u8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      "{%32, %33, %34, %35},"
+      " %36,"
+      " p;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47)
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x96x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x64x32_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x112x32 TN F16+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x112x32_F16E4M3E5M2_SS_TN
+// GMMA 64x96x32 TN S32+=U8*S8
+struct MMA_64x96x32_S32U8S8_RS_TN
 {
   using DRegisters = void;
-  using ARegisters = uint64_t[1];
+  using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[28];
+  using CRegisters = uint32_t[48];
 
   CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
       uint64_t const& desc_b,
       uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
       uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
@@ -37194,21 +10187,29 @@ struct SM90_64x112x32_F16E4M3E5M2_SS_TN
       uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
       uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
       uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %30, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n112k32.f16.e4m3.e5m2 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27},"
-      " %28,"
-      " %29,"
-      " p,   %31, %32;\n"
+      "setp.ne.b32 p, %53, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n96k32.s32.u8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
+      "{%48,  %49,  %50,  %51},"
+      " %52,"
+      " p;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -37216,31 +10217,30 @@ struct SM90_64x112x32_F16E4M3E5M2_SS_TN
         "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
         "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
         "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27)
-      :  "l"(desc_a),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x112x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x96x32_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x112x32 TN F16+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x112x32_F16E4M3E5M2_RS_TN
+// GMMA 64x96x32 TN S32+=U8*S8
+struct MMA_64x96x32_S32U8S8_RS_TN_SATURATE
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[28];
+  using CRegisters = uint32_t[48];
 
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
@@ -37252,21 +10252,29 @@ struct SM90_64x112x32_F16E4M3E5M2_RS_TN
       uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
       uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
       uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %33, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n112k32.f16.e4m3.e5m2 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27},"
-      "{%28, %29, %30, %31},"
-      " %32,"
-      " p,   %34, %35;\n"
+      "setp.ne.b32 p, %53, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n96k32.s32.u8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
+      "{%48,  %49,  %50,  %51},"
+      " %52,"
+      " p;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -37274,183 +10282,183 @@ struct SM90_64x112x32_F16E4M3E5M2_RS_TN
         "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
         "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
         "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27)
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x112x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x96x32_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x112x32 TN F32+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x112x32_F32E4M3E5M2_SS_TN
+// GMMA 64x128x32 TN S32+=U8*S8
+struct MMA_64x128x32_S32U8S8_RS_TN
 {
   using DRegisters = void;
-  using ARegisters = uint64_t[1];
+  using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[56];
+  using CRegisters = uint32_t[64];
 
   CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %58, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n112k32.f32.e4m3.e5m2 "
+      "setp.ne.b32 p, %69, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n128k32.s32.u8.s8 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
       " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
       " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
-      " %56,"
-      " %57,"
-      " p,    %59,  %60;\n"
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      "{%64,  %65,  %66,  %67},"
+      " %68,"
+      " p;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
-      :  "l"(desc_a),
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x112x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x128x32_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x112x32 TN F32+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x112x32_F32E4M3E5M2_RS_TN
+// GMMA 64x128x32 TN S32+=U8*S8
+struct MMA_64x128x32_S32U8S8_RS_TN_SATURATE
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[56];
+  using CRegisters = uint32_t[64];
 
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %61, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n112k32.f32.e4m3.e5m2 "
+      "setp.ne.b32 p, %69, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n128k32.s32.u8.s8.satfinite "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
       " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
       " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
-      "{%56,  %57,  %58,  %59},"
-      " %60,"
-      " p,    %62,  %63;\n"
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      "{%64,  %65,  %66,  %67},"
+      " %68,"
+      " p;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x112x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x128x32_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x128x32 TN F16+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x128x32_F16E4M3E5M2_SS_TN
+// GMMA 64x192x32 TN S32+=U8*S8
+struct MMA_64x192x32_S32U8S8_RS_TN
 {
   using DRegisters = void;
-  using ARegisters = uint64_t[1];
+  using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[32];
+  using CRegisters = uint32_t[96];
 
   CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
       uint64_t const& desc_b,
       uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
       uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
@@ -37460,21 +10468,46 @@ struct SM90_64x128x32_F16E4M3E5M2_SS_TN
       uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
       uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
       uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
+      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %34, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n128k32.f16.e4m3.e5m2 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31},"
-      " %32,"
-      " %33,"
-      " p,   %35, %36;\n"
+      "setp.ne.b32 p, %101, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n192k32.s32.u8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      "{%96,  %97,  %98,  %99},"
+      " %100,"
+      " p;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -37483,29 +10516,41 @@ struct SM90_64x128x32_F16E4M3E5M2_SS_TN
         "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
         "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
         "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
-      :  "l"(desc_a),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
+        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
+        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x128x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x192x32_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x128x32 TN F16+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x128x32_F16E4M3E5M2_RS_TN
+// GMMA 64x192x32 TN S32+=U8*S8
+struct MMA_64x192x32_S32U8S8_RS_TN_SATURATE
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[32];
+  using CRegisters = uint32_t[96];
 
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
@@ -37518,21 +10563,46 @@ struct SM90_64x128x32_F16E4M3E5M2_RS_TN
       uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
       uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
       uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
+      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %37, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n128k32.f16.e4m3.e5m2 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31},"
-      "{%32, %33, %34, %35},"
-      " %36,"
-      " p,   %38, %39;\n"
+      "setp.ne.b32 p, %101, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n192k32.s32.u8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      "{%96,  %97,  %98,  %99},"
+      " %100,"
+      " p;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -37541,57 +10611,86 @@ struct SM90_64x128x32_F16E4M3E5M2_RS_TN
         "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
         "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
         "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
+        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
+        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x128x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x192x32_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x128x32 TN F32+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x128x32_F32E4M3E5M2_SS_TN
+// GMMA 64x256x32 TN S32+=U8*S8
+struct MMA_64x256x32_S32U8S8_RS_TN
 {
   using DRegisters = void;
-  using ARegisters = uint64_t[1];
+  using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[64];
+  using CRegisters = uint32_t[128];
 
   CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
+      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %66, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n128k32.f32.e4m3.e5m2 "
+      "setp.ne.b32 p, %133, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n256k32.s32.u8.s8 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -37599,77 +10698,114 @@ struct SM90_64x128x32_F32E4M3E5M2_SS_TN
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
       " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
       " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
-      " %64,"
-      " %65,"
-      " p,    %67,  %68;\n"
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      "{%128, %129, %130, %131},"
+      " %132,"
+      " p;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
-      :  "l"(desc_a),
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
+        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
+        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x128x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x256x32_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x128x32 TN F32+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x128x32_F32E4M3E5M2_RS_TN
+// GMMA 64x256x32 TN S32+=U8*S8
+struct MMA_64x256x32_S32U8S8_RS_TN_SATURATE
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[64];
+  using CRegisters = uint32_t[128];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
+      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %69, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n128k32.f32.e4m3.e5m2 "
+      "setp.ne.b32 p, %133, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n256k32.s32.u8.s8.satfinite "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -37677,346 +10813,225 @@ struct SM90_64x128x32_F32E4M3E5M2_RS_TN
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
       " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
       " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
-      "{%64,  %65,  %66,  %67},"
-      " %68,"
-      " p,    %70,  %71;\n"
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      "{%128, %129, %130, %131},"
+      " %132,"
+      " p;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
+        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
+        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x128x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x256x32_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x144x32 TN F16+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x144x32_F16E4M3E5M2_SS_TN
+// GMMA 64x8x32 TN S32+=U8*U8
+struct MMA_64x8x32_S32U8U8_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[36];
+  using CRegisters = uint32_t[4];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %38, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n144k32.f16.e4m3.e5m2 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35},"
-      " %36,"
-      " %37,"
-      " p,   %39, %40;\n"
+      "setp.ne.b32 p, %6, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n8k32.s32.u8.u8 "
+      "{%0,  %1,  %2,  %3},"
+      " %4,"
+      " %5,"
+      " p;\n"
     "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35)
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
       :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x144x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x8x32_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x144x32 TN F16+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x144x32_F16E4M3E5M2_RS_TN
+// GMMA 64x8x32 TN S32+=U8*U8
+struct MMA_64x8x32_S32U8U8_SS_TN_SATURATE
 {
   using DRegisters = void;
-  using ARegisters = uint32_t[4];
+  using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[36];
+  using CRegisters = uint32_t[4];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %41, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n144k32.f16.e4m3.e5m2 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35},"
-      "{%36, %37, %38, %39},"
-      " %40,"
-      " p,   %42, %43;\n"
+      "setp.ne.b32 p, %6, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n8k32.s32.u8.u8.satfinite "
+      "{%0,  %1,  %2,  %3},"
+      " %4,"
+      " %5,"
+      " p;\n"
     "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
+      :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x144x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x8x32_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x144x32 TN F32+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x144x32_F32E4M3E5M2_SS_TN
+// GMMA 64x16x32 TN S32+=U8*U8
+struct MMA_64x16x32_S32U8U8_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[72];
+  using CRegisters = uint32_t[8];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %74, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n144k32.f32.e4m3.e5m2 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
-      " %72,"
-      " %73,"
-      " p,    %75,  %76;\n"
+      "setp.ne.b32 p, %10, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n16k32.s32.u8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      " %8,"
+      " %9,"
+      " p;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71)
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
       :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x144x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x16x32_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x144x32 TN F32+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x144x32_F32E4M3E5M2_RS_TN
+// GMMA 64x16x32 TN S32+=U8*U8
+struct MMA_64x16x32_S32U8U8_SS_TN_SATURATE
 {
   using DRegisters = void;
-  using ARegisters = uint32_t[4];
+  using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[72];
+  using CRegisters = uint32_t[8];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %77, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n144k32.f32.e4m3.e5m2 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
-      "{%72,  %73,  %74,  %75},"
-      " %76,"
-      " p,    %78,  %79;\n"
+      "setp.ne.b32 p, %10, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n16k32.s32.u8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      " %8,"
+      " %9,"
+      " p;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
+      :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x144x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x16x32_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x160x32 TN F16+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x160x32_F16E4M3E5M2_SS_TN
+// GMMA 64x32x32 TN S32+=U8*U8
+struct MMA_64x32x32_S32U8U8_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[40];
+  using CRegisters = uint32_t[16];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
@@ -38025,308 +11040,198 @@ struct SM90_64x160x32_F16E4M3E5M2_SS_TN
       uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
       uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
       uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %42, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n160k32.f16.e4m3.e5m2 "
+      "setp.ne.b32 p, %18, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n32k32.s32.u8.u8 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39},"
-      " %40,"
-      " %41,"
-      " p,   %43, %44;\n"
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      " %16,"
+      " %17,"
+      " p;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
         "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
       :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x160x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x32x32_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x160x32 TN F16+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x160x32_F16E4M3E5M2_RS_TN
+// GMMA 64x32x32 TN S32+=U8*U8
+struct MMA_64x32x32_S32U8U8_SS_TN_SATURATE
 {
   using DRegisters = void;
-  using ARegisters = uint32_t[4];
+  using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[40];
+  using CRegisters = uint32_t[16];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
       uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
       uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
       uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
       uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %45, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n160k32.f16.e4m3.e5m2 "
+      "setp.ne.b32 p, %18, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n32k32.s32.u8.u8.satfinite "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39},"
-      "{%40, %41, %42, %43},"
-      " %44,"
-      " p,   %46, %47;\n"
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      " %16,"
+      " %17,"
+      " p;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
         "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
+      :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x160x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x32x32_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x160x32 TN F32+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x160x32_F32E4M3E5M2_SS_TN
+// GMMA 64x64x32 TN S32+=U8*U8
+struct MMA_64x64x32_S32U8U8_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[80];
+  using CRegisters = uint32_t[32];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      float         & d72, float         & d73, float         & d74, float         & d75,
-      float         & d76, float         & d77, float         & d78, float         & d79,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %82, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n160k32.f32.e4m3.e5m2 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
-      " %80,"
-      " %81,"
-      " p,    %83,  %84;\n"
+      "setp.ne.b32 p, %34, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n64k32.s32.u8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      " %32,"
+      " %33,"
+      " p;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
-        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
-        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79)
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
       :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x160x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x64x32_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x160x32 TN F32+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x160x32_F32E4M3E5M2_RS_TN
+// GMMA 64x64x32 TN S32+=U8*U8
+struct MMA_64x64x32_S32U8U8_SS_TN_SATURATE
 {
   using DRegisters = void;
-  using ARegisters = uint32_t[4];
+  using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[80];
+  using CRegisters = uint32_t[32];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      float         & d72, float         & d73, float         & d74, float         & d75,
-      float         & d76, float         & d77, float         & d78, float         & d79,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %85, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n160k32.f32.e4m3.e5m2 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
-      "{%80,  %81,  %82,  %83},"
-      " %84,"
-      " p,    %86,  %87;\n"
+      "setp.ne.b32 p, %34, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n64k32.s32.u8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      " %32,"
+      " %33,"
+      " p;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
-        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
-        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
+      :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x160x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x64x32_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x176x32 TN F16+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x176x32_F16E4M3E5M2_SS_TN
+// GMMA 64x96x32 TN S32+=U8*U8
+struct MMA_64x96x32_S32U8U8_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[44];
+  using CRegisters = uint32_t[48];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
@@ -38342,23 +11247,25 @@ struct SM90_64x176x32_F16E4M3E5M2_SS_TN
       uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
       uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
       uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %46, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n176k32.f16.e4m3.e5m2 "
+      "setp.ne.b32 p, %50, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n96k32.s32.u8.u8 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
       " %8,  %9,  %10, %11, %12, %13, %14, %15, "
       " %16, %17, %18, %19, %20, %21, %22, %23, "
       " %24, %25, %26, %27, %28, %29, %30, %31, "
       " %32, %33, %34, %35, %36, %37, %38, %39, "
-      " %40, %41, %42, %43},"
-      " %44,"
-      " %45,"
-      " p,   %47, %48;\n"
+      " %40, %41, %42, %43, %44, %45, %46, %47},"
+      " %48,"
+      " %49,"
+      " p;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -38370,34 +11277,29 @@ struct SM90_64x176x32_F16E4M3E5M2_SS_TN
         "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
         "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
         "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43)
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
       :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x176x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x96x32_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x176x32 TN F16+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x176x32_F16E4M3E5M2_RS_TN
+// GMMA 64x96x32 TN S32+=U8*U8
+struct MMA_64x96x32_S32U8U8_SS_TN_SATURATE
 {
   using DRegisters = void;
-  using ARegisters = uint32_t[4];
+  using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[44];
+  using CRegisters = uint32_t[48];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
       uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
       uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
@@ -38410,23 +11312,25 @@ struct SM90_64x176x32_F16E4M3E5M2_RS_TN
       uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
       uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
       uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %49, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n176k32.f16.e4m3.e5m2 "
+      "setp.ne.b32 p, %50, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n96k32.s32.u8.u8.satfinite "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
       " %8,  %9,  %10, %11, %12, %13, %14, %15, "
       " %16, %17, %18, %19, %20, %21, %22, %23, "
       " %24, %25, %26, %27, %28, %29, %30, %31, "
       " %32, %33, %34, %35, %36, %37, %38, %39, "
-      " %40, %41, %42, %43},"
-      "{%44, %45, %46, %47},"
+      " %40, %41, %42, %43, %44, %45, %46, %47},"
       " %48,"
-      " p,   %50, %51;\n"
+      " %49,"
+      " p;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -38438,65 +11342,55 @@ struct SM90_64x176x32_F16E4M3E5M2_RS_TN
         "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
         "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
         "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
+      :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x176x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x96x32_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x176x32 TN F32+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x176x32_F32E4M3E5M2_SS_TN
+// GMMA 64x128x32 TN S32+=U8*U8
+struct MMA_64x128x32_S32U8U8_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[88];
+  using CRegisters = uint32_t[64];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      float         & d72, float         & d73, float         & d74, float         & d75,
-      float         & d76, float         & d77, float         & d78, float         & d79,
-      float         & d80, float         & d81, float         & d82, float         & d83,
-      float         & d84, float         & d85, float         & d86, float         & d87,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %90, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n176k32.f32.e4m3.e5m2 "
+      "setp.ne.b32 p, %66, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n128k32.s32.u8.u8 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -38504,94 +11398,74 @@ struct SM90_64x176x32_F32E4M3E5M2_SS_TN
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
       " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
       " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
-      " %88,"
-      " %89,"
-      " p,    %91,  %92;\n"
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      " %64,"
+      " %65,"
+      " p;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
-        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
-        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
-        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
-        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87)
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
       :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x176x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x128x32_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x176x32 TN F32+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x176x32_F32E4M3E5M2_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[88];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      float         & d72, float         & d73, float         & d74, float         & d75,
-      float         & d76, float         & d77, float         & d78, float         & d79,
-      float         & d80, float         & d81, float         & d82, float         & d83,
-      float         & d84, float         & d85, float         & d86, float         & d87,
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x128x32 TN S32+=U8*U8
+struct MMA_64x128x32_S32U8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[64];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %93, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n176k32.f32.e4m3.e5m2 "
+      "setp.ne.b32 p, %66, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n128k32.s32.u8.u8.satfinite "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -38599,59 +11473,45 @@ struct SM90_64x176x32_F32E4M3E5M2_RS_TN
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
       " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
       " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
-      "{%88,  %89,  %90,  %91},"
-      " %92,"
-      " p,    %94,  %95;\n"
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      " %64,"
+      " %65,"
+      " p;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
-        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
-        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
-        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
-        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
+      :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x176x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x128x32_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x192x32 TN F16+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x192x32_F16E4M3E5M2_SS_TN
+// GMMA 64x192x32 TN S32+=U8*U8
+struct MMA_64x192x32_S32U8U8_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[48];
+  using CRegisters = uint32_t[96];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
@@ -38668,23 +11528,42 @@ struct SM90_64x192x32_F16E4M3E5M2_SS_TN
       uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
       uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
       uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
+      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %50, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n192k32.f16.e4m3.e5m2 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39, "
-      " %40, %41, %42, %43, %44, %45, %46, %47},"
-      " %48,"
-      " %49,"
-      " p,   %51, %52;\n"
+      "setp.ne.b32 p, %98, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n192k32.s32.u8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      " %96,"
+      " %97,"
+      " p;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -38697,32 +11576,40 @@ struct SM90_64x192x32_F16E4M3E5M2_SS_TN
         "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
         "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
         "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
+        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
+        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
       :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x192x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x192x32_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x192x32 TN F16+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x192x32_F16E4M3E5M2_RS_TN
+// GMMA 64x192x32 TN S32+=U8*U8
+struct MMA_64x192x32_S32U8U8_SS_TN_SATURATE
 {
   using DRegisters = void;
-  using ARegisters = uint32_t[4];
+  using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[48];
+  using CRegisters = uint32_t[96];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
       uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
       uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
@@ -38736,23 +11623,42 @@ struct SM90_64x192x32_F16E4M3E5M2_RS_TN
       uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
       uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
       uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
+      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %53, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n192k32.f16.e4m3.e5m2 "
+      "setp.ne.b32 p, %98, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n192k32.s32.u8.u8.satfinite "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
       " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
-      "{%48,  %49,  %50,  %51},"
-      " %52,"
-      " p,    %54,  %55;\n"
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      " %96,"
+      " %97,"
+      " p;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -38765,65 +11671,82 @@ struct SM90_64x192x32_F16E4M3E5M2_RS_TN
         "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
         "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
         "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
+        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
+        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
+      :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x192x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x192x32_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x192x32 TN F32+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x192x32_F32E4M3E5M2_SS_TN
+// GMMA 64x256x32 TN S32+=U8*U8
+struct MMA_64x256x32_S32U8U8_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[96];
+  using CRegisters = uint32_t[128];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      float         & d72, float         & d73, float         & d74, float         & d75,
-      float         & d76, float         & d77, float         & d78, float         & d79,
-      float         & d80, float         & d81, float         & d82, float         & d83,
-      float         & d84, float         & d85, float         & d86, float         & d87,
-      float         & d88, float         & d89, float         & d90, float         & d91,
-      float         & d92, float         & d93, float         & d94, float         & d95,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
+      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %98, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n192k32.f32.e4m3.e5m2 "
+      "setp.ne.b32 p, %130, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n256k32.s32.u8.u8 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -38835,93 +11758,110 @@ struct SM90_64x192x32_F32E4M3E5M2_SS_TN
       " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
       " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
       " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
-      " %96,"
-      " %97,"
-      " p,    %99,  %100;\n"
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      " %128,"
+      " %129,"
+      " p;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
-        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
-        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
-        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
-        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
-        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91),
-        "+f"(d92), "+f"(d93), "+f"(d94), "+f"(d95)
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
+        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
+        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
       :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x192x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x256x32_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x192x32 TN F32+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x192x32_F32E4M3E5M2_RS_TN
+// GMMA 64x256x32 TN S32+=U8*U8
+struct MMA_64x256x32_S32U8U8_SS_TN_SATURATE
 {
   using DRegisters = void;
-  using ARegisters = uint32_t[4];
+  using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[96];
+  using CRegisters = uint32_t[128];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      float         & d72, float         & d73, float         & d74, float         & d75,
-      float         & d76, float         & d77, float         & d78, float         & d79,
-      float         & d80, float         & d81, float         & d82, float         & d83,
-      float         & d84, float         & d85, float         & d86, float         & d87,
-      float         & d88, float         & d89, float         & d90, float         & d91,
-      float         & d92, float         & d93, float         & d94, float         & d95,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
+      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %101, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n192k32.f32.e4m3.e5m2 "
+      "setp.ne.b32 p, %130, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n256k32.s32.u8.u8.satfinite "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -38933,489 +11873,266 @@ struct SM90_64x192x32_F32E4M3E5M2_RS_TN
       " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
       " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
       " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
-      "{%96,  %97,  %98,  %99},"
-      " %100,"
-      " p,    %102, %103;\n"
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      " %128,"
+      " %129,"
+      " p;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
-        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
-        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
-        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
-        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
-        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91),
-        "+f"(d92), "+f"(d93), "+f"(d94), "+f"(d95)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
+        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
+        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
+      :  "l"(desc_a),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x192x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x256x32_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x208x32 TN F16+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x208x32_F16E4M3E5M2_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[52];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+// GMMA 64x8x32 TN S32+=U8*U8
+struct MMA_64x8x32_S32U8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %54, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n208k32.f16.e4m3.e5m2 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51},"
-      " %52,"
-      " %53,"
-      " p,    %55,  %56;\n"
+      "setp.ne.b32 p, %9, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n8k32.s32.u8.u8 "
+      "{%0,  %1,  %2,  %3},"
+      "{%4,  %5,  %6,  %7},"
+      " %8,"
+      " p;\n"
     "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51)
-      :  "l"(desc_a),
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x208x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x8x32_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x208x32 TN F16+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x208x32_F16E4M3E5M2_RS_TN
+// GMMA 64x8x32 TN S32+=U8*U8
+struct MMA_64x8x32_S32U8U8_RS_TN_SATURATE
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[52];
+  using CRegisters = uint32_t[4];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
       uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %57, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n208k32.f16.e4m3.e5m2 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51},"
-      "{%52,  %53,  %54,  %55},"
-      " %56,"
-      " p,    %58,  %59;\n"
+      "setp.ne.b32 p, %9, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n8k32.s32.u8.u8.satfinite "
+      "{%0,  %1,  %2,  %3},"
+      "{%4,  %5,  %6,  %7},"
+      " %8,"
+      " p;\n"
     "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x208x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x8x32_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x208x32 TN F32+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x208x32_F32E4M3E5M2_SS_TN
+// GMMA 64x16x32 TN S32+=U8*U8
+struct MMA_64x16x32_S32U8U8_RS_TN
 {
   using DRegisters = void;
-  using ARegisters = uint64_t[1];
+  using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[104];
+  using CRegisters = uint32_t[8];
 
   CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
       uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %106, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n208k32.f32.e4m3.e5m2 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
-      " %104,"
-      " %105,"
-      " p,    %107, %108;\n"
+      "setp.ne.b32 p, %13, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n16k32.s32.u8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      "{%8,  %9,  %10, %11},"
+      " %12,"
+      " p;\n"
     "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103)
-      :  "l"(desc_a),
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x208x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x16x32_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x208x32 TN F32+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x208x32_F32E4M3E5M2_RS_TN
+// GMMA 64x16x32 TN S32+=U8*U8
+struct MMA_64x16x32_S32U8U8_RS_TN_SATURATE
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[104];
+  using CRegisters = uint32_t[8];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
       uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %109, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n208k32.f32.e4m3.e5m2 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
-      "{%104, %105, %106, %107},"
-      " %108,"
-      " p,    %110, %111;\n"
+      "setp.ne.b32 p, %13, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n16k32.s32.u8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      "{%8,  %9,  %10, %11},"
+      " %12,"
+      " p;\n"
     "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x208x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x16x32_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x224x32 TN F16+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x224x32_F16E4M3E5M2_SS_TN
+// GMMA 64x32x32 TN S32+=U8*U8
+struct MMA_64x32x32_S32U8U8_RS_TN
 {
   using DRegisters = void;
-  using ARegisters = uint64_t[1];
+  using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[56];
+  using CRegisters = uint32_t[16];
 
   CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
       uint64_t const& desc_b,
       uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
       uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
       uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
       uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %58, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n224k32.f16.e4m3.e5m2 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
-      " %56,"
-      " %57,"
-      " p,    %59,  %60;\n"
+      "setp.ne.b32 p, %21, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n32k32.s32.u8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      "{%16, %17, %18, %19},"
+      " %20,"
+      " p;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
         "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
-      :  "l"(desc_a),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x224x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x32x32_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x224x32 TN F16+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x224x32_F16E4M3E5M2_RS_TN
+// GMMA 64x32x32 TN S32+=U8*U8
+struct MMA_64x32x32_S32U8U8_RS_TN_SATURATE
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[56];
+  using CRegisters = uint32_t[16];
 
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
@@ -39424,296 +12141,156 @@ struct SM90_64x224x32_F16E4M3E5M2_RS_TN
       uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
       uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
       uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %61, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n224k32.f16.e4m3.e5m2 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
-      "{%56,  %57,  %58,  %59},"
-      " %60,"
-      " p,    %62,  %63;\n"
+      "setp.ne.b32 p, %21, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n32k32.s32.u8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      "{%16, %17, %18, %19},"
+      " %20,"
+      " p;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
         "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x224x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x32x32_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x224x32 TN F32+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x224x32_F32E4M3E5M2_SS_TN
+// GMMA 64x64x32 TN S32+=U8*U8
+struct MMA_64x64x32_S32U8U8_RS_TN
 {
   using DRegisters = void;
-  using ARegisters = uint64_t[1];
+  using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[112];
+  using CRegisters = uint32_t[32];
 
   CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
       uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
-      float         & d104, float         & d105, float         & d106, float         & d107,
-      float         & d108, float         & d109, float         & d110, float         & d111,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %114, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n224k32.f32.e4m3.e5m2 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111},"
-      " %112,"
-      " %113,"
-      " p,    %115, %116;\n"
+      "setp.ne.b32 p, %37, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n64k32.s32.u8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      "{%32, %33, %34, %35},"
+      " %36,"
+      " p;\n"
     "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
-        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
-        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111)
-      :  "l"(desc_a),
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x224x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x64x32_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x224x32 TN F32+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x224x32_F32E4M3E5M2_RS_TN
+// GMMA 64x64x32 TN S32+=U8*U8
+struct MMA_64x64x32_S32U8U8_RS_TN_SATURATE
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[112];
+  using CRegisters = uint32_t[32];
 
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
-      uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
-      float         & d104, float         & d105, float         & d106, float         & d107,
-      float         & d108, float         & d109, float         & d110, float         & d111,
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %117, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n224k32.f32.e4m3.e5m2 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111},"
-      "{%112, %113, %114, %115},"
-      " %116,"
-      " p,    %118, %119;\n"
+      "setp.ne.b32 p, %37, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n64k32.s32.u8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      "{%32, %33, %34, %35},"
+      " %36,"
+      " p;\n"
     "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
-        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
-        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x224x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x64x32_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x240x32 TN F16+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x240x32_F16E4M3E5M2_SS_TN
+// GMMA 64x96x32 TN S32+=U8*U8
+struct MMA_64x96x32_S32U8U8_RS_TN
 {
   using DRegisters = void;
-  using ARegisters = uint64_t[1];
+  using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[60];
+  using CRegisters = uint32_t[48];
 
   CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
       uint64_t const& desc_b,
       uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
       uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
@@ -39727,28 +12304,24 @@ struct SM90_64x240x32_F16E4M3E5M2_SS_TN
       uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
       uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
       uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %62, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n240k32.f16.e4m3.e5m2 "
+      "setp.ne.b32 p, %53, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n96k32.s32.u8.u8 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
       " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59},"
-      " %60,"
-      " %61,"
-      " p,    %63,  %64;\n"
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
+      "{%48,  %49,  %50,  %51},"
+      " %52,"
+      " p;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -39761,34 +12334,25 @@ struct SM90_64x240x32_F16E4M3E5M2_SS_TN
         "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
         "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
         "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59)
-      :  "l"(desc_a),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x240x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x96x32_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x240x32 TN F16+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x240x32_F16E4M3E5M2_RS_TN
+// GMMA 64x96x32 TN S32+=U8*U8
+struct MMA_64x96x32_S32U8U8_RS_TN_SATURATE
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[60];
+  using CRegisters = uint32_t[48];
 
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
@@ -39805,28 +12369,24 @@ struct SM90_64x240x32_F16E4M3E5M2_RS_TN
       uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
       uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
       uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %65, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n240k32.f16.e4m3.e5m2 "
+      "setp.ne.b32 p, %53, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n96k32.s32.u8.u8.satfinite "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
       " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59},"
-      "{%60,  %61,  %62,  %63},"
-      " %64,"
-      " p,    %66,  %67;\n"
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
+      "{%48,  %49,  %50,  %51},"
+      " %52,"
+      " p;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -39839,76 +12399,54 @@ struct SM90_64x240x32_F16E4M3E5M2_RS_TN
         "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
         "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
         "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59)
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x240x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x96x32_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x240x32 TN F32+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x240x32_F32E4M3E5M2_SS_TN
+// GMMA 64x128x32 TN S32+=U8*U8
+struct MMA_64x128x32_S32U8U8_RS_TN
 {
   using DRegisters = void;
-  using ARegisters = uint64_t[1];
+  using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[120];
+  using CRegisters = uint32_t[64];
 
   CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
       uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
-      float         & d104, float         & d105, float         & d106, float         & d107,
-      float         & d108, float         & d109, float         & d110, float         & d111,
-      float         & d112, float         & d113, float         & d114, float         & d115,
-      float         & d116, float         & d117, float         & d118, float         & d119,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %122, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n240k32.f32.e4m3.e5m2 "
+      "setp.ne.b32 p, %69, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n128k32.s32.u8.u8 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -39916,114 +12454,74 @@ struct SM90_64x240x32_F32E4M3E5M2_SS_TN
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
       " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
       " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119},"
-      " %120,"
-      " %121,"
-      " p,    %123, %124;\n"
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      "{%64,  %65,  %66,  %67},"
+      " %68,"
+      " p;\n"
     "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
-        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
-        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
-        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
-        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119)
-      :  "l"(desc_a),
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x240x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x128x32_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x240x32 TN F32+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x240x32_F32E4M3E5M2_RS_TN
+// GMMA 64x128x32 TN S32+=U8*U8
+struct MMA_64x128x32_S32U8U8_RS_TN_SATURATE
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[120];
+  using CRegisters = uint32_t[64];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
       uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
-      float         & d104, float         & d105, float         & d106, float         & d107,
-      float         & d108, float         & d109, float         & d110, float         & d111,
-      float         & d112, float         & d113, float         & d114, float         & d115,
-      float         & d116, float         & d117, float         & d118, float         & d119,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %125, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n240k32.f32.e4m3.e5m2 "
+      "setp.ne.b32 p, %69, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n128k32.s32.u8.u8.satfinite "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -40031,74 +12529,48 @@ struct SM90_64x240x32_F32E4M3E5M2_RS_TN
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
       " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
       " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119},"
-      "{%120, %121, %122, %123},"
-      " %124,"
-      " p,    %126, %127;\n"
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      "{%64,  %65,  %66,  %67},"
+      " %68,"
+      " p;\n"
     "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
-        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
-        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
-        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
-        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x240x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x128x32_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x256x32 TN F16+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x256x32_F16E4M3E5M2_SS_TN
+// GMMA 64x192x32 TN S32+=U8*U8
+struct MMA_64x192x32_S32U8U8_RS_TN
 {
   using DRegisters = void;
-  using ARegisters = uint64_t[1];
+  using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[64];
+  using CRegisters = uint32_t[96];
 
   CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
       uint64_t const& desc_b,
       uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
       uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
@@ -40116,14 +12588,23 @@ struct SM90_64x256x32_F16E4M3E5M2_SS_TN
       uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
       uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
       uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
+      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %66, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n256k32.f16.e4m3.e5m2 "
+      "setp.ne.b32 p, %101, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n192k32.s32.u8.u8 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -40131,10 +12612,14 @@ struct SM90_64x256x32_F16E4M3E5M2_SS_TN
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
       " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
       " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
-      " %64,"
-      " %65,"
-      " p,    %67,  %68;\n"
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      "{%96,  %97,  %98,  %99},"
+      " %100,"
+      " p;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -40151,29 +12636,33 @@ struct SM90_64x256x32_F16E4M3E5M2_SS_TN
         "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
         "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
         "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
-      :  "l"(desc_a),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
+        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
+        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x256x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x192x32_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x256x32 TN F16+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x256x32_F16E4M3E5M2_RS_TN
+// GMMA 64x192x32 TN S32+=U8*U8
+struct MMA_64x192x32_S32U8U8_RS_TN_SATURATE
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[64];
+  using CRegisters = uint32_t[96];
 
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
@@ -40194,14 +12683,23 @@ struct SM90_64x256x32_F16E4M3E5M2_RS_TN
       uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
       uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
       uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
+      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %69, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n256k32.f16.e4m3.e5m2 "
+      "setp.ne.b32 p, %101, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n192k32.s32.u8.u8.satfinite "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -40209,10 +12707,14 @@ struct SM90_64x256x32_F16E4M3E5M2_RS_TN
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
       " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
       " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
-      "{%64,  %65,  %66,  %67},"
-      " %68,"
-      " p,    %70,  %71;\n"
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      "{%96,  %97,  %98,  %99},"
+      " %100,"
+      " p;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -40229,73 +12731,78 @@ struct SM90_64x256x32_F16E4M3E5M2_RS_TN
         "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
         "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
         "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
-        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x256x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-// GMMA 64x256x32 TN F32+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x256x32_F32E4M3E5M2_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[128];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
-      float         & d104, float         & d105, float         & d106, float         & d107,
-      float         & d108, float         & d109, float         & d110, float         & d111,
-      float         & d112, float         & d113, float         & d114, float         & d115,
-      float         & d116, float         & d117, float         & d118, float         & d119,
-      float         & d120, float         & d121, float         & d122, float         & d123,
-      float         & d124, float         & d125, float         & d126, float         & d127,
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
+        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
+        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x192x32_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x256x32 TN S32+=U8*U8
+struct MMA_64x256x32_S32U8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[128];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
+      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %130, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n256k32.f32.e4m3.e5m2 "
+      "setp.ne.b32 p, %133, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n256k32.s32.u8.u8 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -40312,108 +12819,105 @@ struct SM90_64x256x32_F32E4M3E5M2_SS_TN
       " %104, %105, %106, %107, %108, %109, %110, %111, "
       " %112, %113, %114, %115, %116, %117, %118, %119, "
       " %120, %121, %122, %123, %124, %125, %126, %127},"
-      " %128,"
-      " %129,"
-      " p,    %131, %132;\n"
+      "{%128, %129, %130, %131},"
+      " %132,"
+      " p;\n"
     "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
-        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
-        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
-        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
-        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
-        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123),
-        "+f"(d124), "+f"(d125), "+f"(d126), "+f"(d127)
-      :  "l"(desc_a),
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
+        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
+        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x256x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x256x32_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x256x32 TN F32+=E4M3*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x256x32_F32E4M3E5M2_RS_TN
+// GMMA 64x256x32 TN S32+=U8*U8
+struct MMA_64x256x32_S32U8U8_RS_TN_SATURATE
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[128];
+  using CRegisters = uint32_t[128];
 
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
       uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
-      float         & d104, float         & d105, float         & d106, float         & d107,
-      float         & d108, float         & d109, float         & d110, float         & d111,
-      float         & d112, float         & d113, float         & d114, float         & d115,
-      float         & d116, float         & d117, float         & d118, float         & d119,
-      float         & d120, float         & d121, float         & d122, float         & d123,
-      float         & d124, float         & d125, float         & d126, float         & d127,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
+      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %133, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n256k32.f32.e4m3.e5m2 "
+      "wgmma.mma_async.sync.aligned.m64n256k32.s32.u8.u8.satfinite "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -40432,57 +12936,57 @@ struct SM90_64x256x32_F32E4M3E5M2_RS_TN
       " %120, %121, %122, %123, %124, %125, %126, %127},"
       "{%128, %129, %130, %131},"
       " %132,"
-      " p,    %134, %135;\n"
+      " p;\n"
     "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
-        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
-        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
-        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
-        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
-        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123),
-        "+f"(d124), "+f"(d125), "+f"(d126), "+f"(d127)
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
+        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
+        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
       :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
          "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+         "r"(int32_t(scale_D)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x256x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x256x32_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x8x32 TN F16+=E5M2*E4M3
+// GMMA 64x8x32 TN F16+=E4M3*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x8x32_F16E5M2E4M3_SS_TN
+struct MMA_64x8x32_F16E4M3E4M3_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
@@ -40496,11 +13000,12 @@ struct SM90_64x8x32_F16E5M2E4M3_SS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %4, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n8k32.f16.e5m2.e4m3 "
+      "wgmma.mma_async.sync.aligned.m64n8k32.f16.e4m3.e4m3 "
       "{%0, %1},"
       " %2,"
       " %3,"
@@ -40511,19 +13016,19 @@ struct SM90_64x8x32_F16E5M2E4M3_SS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x8x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x8x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x8x32 TN F16+=E5M2*E4M3
+// GMMA 64x8x32 TN F16+=E4M3*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x8x32_F16E5M2E4M3_RS_TN
+struct MMA_64x8x32_F16E4M3E4M3_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
@@ -40537,11 +13042,12 @@ struct SM90_64x8x32_F16E5M2E4M3_RS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %7, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n8k32.f16.e5m2.e4m3 "
+      "wgmma.mma_async.sync.aligned.m64n8k32.f16.e4m3.e4m3 "
       "{%0,  %1},"
       "{%2,  %3,  %4,  %5},"
       " %6,"
@@ -40552,19 +13058,19 @@ struct SM90_64x8x32_F16E5M2E4M3_RS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x8x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x8x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x8x32 TN F32+=E5M2*E4M3
+// GMMA 64x8x32 TN F32+=E4M3*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x8x32_F32E5M2E4M3_SS_TN
+struct MMA_64x8x32_F32E4M3E4M3_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
@@ -40578,11 +13084,12 @@ struct SM90_64x8x32_F32E5M2E4M3_SS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %6, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n8k32.f32.e5m2.e4m3 "
+      "wgmma.mma_async.sync.aligned.m64n8k32.f32.e4m3.e4m3 "
       "{%0,  %1,  %2,  %3},"
       " %4,"
       " %5,"
@@ -40593,19 +13100,19 @@ struct SM90_64x8x32_F32E5M2E4M3_SS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x8x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x8x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x8x32 TN F32+=E5M2*E4M3
+// GMMA 64x8x32 TN F32+=E4M3*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x8x32_F32E5M2E4M3_RS_TN
+struct MMA_64x8x32_F32E4M3E4M3_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
@@ -40619,11 +13126,12 @@ struct SM90_64x8x32_F32E5M2E4M3_RS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %9, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n8k32.f32.e5m2.e4m3 "
+      "wgmma.mma_async.sync.aligned.m64n8k32.f32.e4m3.e4m3 "
       "{%0,  %1,  %2,  %3},"
       "{%4,  %5,  %6,  %7},"
       " %8,"
@@ -40634,19 +13142,19 @@ struct SM90_64x8x32_F32E5M2E4M3_RS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x8x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x8x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x16x32 TN F16+=E5M2*E4M3
+// GMMA 64x16x32 TN F16+=E4M3*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x16x32_F16E5M2E4M3_SS_TN
+struct MMA_64x16x32_F16E4M3E4M3_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
@@ -40660,11 +13168,12 @@ struct SM90_64x16x32_F16E5M2E4M3_SS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %6, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n16k32.f16.e5m2.e4m3 "
+      "wgmma.mma_async.sync.aligned.m64n16k32.f16.e4m3.e4m3 "
       "{%0,  %1,  %2,  %3},"
       " %4,"
       " %5,"
@@ -40675,19 +13184,19 @@ struct SM90_64x16x32_F16E5M2E4M3_SS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x16x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x16x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x16x32 TN F16+=E5M2*E4M3
+// GMMA 64x16x32 TN F16+=E4M3*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x16x32_F16E5M2E4M3_RS_TN
+struct MMA_64x16x32_F16E4M3E4M3_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
@@ -40701,11 +13210,12 @@ struct SM90_64x16x32_F16E5M2E4M3_RS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %9, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n16k32.f16.e5m2.e4m3 "
+      "wgmma.mma_async.sync.aligned.m64n16k32.f16.e4m3.e4m3 "
       "{%0,  %1,  %2,  %3},"
       "{%4,  %5,  %6,  %7},"
       " %8,"
@@ -40716,19 +13226,19 @@ struct SM90_64x16x32_F16E5M2E4M3_RS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x16x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x16x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x16x32 TN F32+=E5M2*E4M3
+// GMMA 64x16x32 TN F32+=E4M3*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x16x32_F32E5M2E4M3_SS_TN
+struct MMA_64x16x32_F32E4M3E4M3_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
@@ -40743,11 +13253,12 @@ struct SM90_64x16x32_F32E5M2E4M3_SS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %10, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n16k32.f32.e5m2.e4m3 "
+      "wgmma.mma_async.sync.aligned.m64n16k32.f32.e4m3.e4m3 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
       " %8,"
       " %9,"
@@ -40759,19 +13270,19 @@ struct SM90_64x16x32_F32E5M2E4M3_SS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x16x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x16x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x16x32 TN F32+=E5M2*E4M3
+// GMMA 64x16x32 TN F32+=E4M3*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x16x32_F32E5M2E4M3_RS_TN
+struct MMA_64x16x32_F32E4M3E4M3_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
@@ -40786,11 +13297,12 @@ struct SM90_64x16x32_F32E5M2E4M3_RS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %13, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n16k32.f32.e5m2.e4m3 "
+      "wgmma.mma_async.sync.aligned.m64n16k32.f32.e4m3.e4m3 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
       "{%8,  %9,  %10, %11},"
       " %12,"
@@ -40802,19 +13314,19 @@ struct SM90_64x16x32_F32E5M2E4M3_RS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x16x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x16x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x32x32 TN F16+=E5M2*E4M3
+// GMMA 64x32x32 TN F16+=E4M3*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x32x32_F16E5M2E4M3_SS_TN
+struct MMA_64x32x32_F16E4M3E4M3_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
@@ -40829,11 +13341,12 @@ struct SM90_64x32x32_F16E5M2E4M3_SS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %10, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n32k32.f16.e5m2.e4m3 "
+      "wgmma.mma_async.sync.aligned.m64n32k32.f16.e4m3.e4m3 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
       " %8,"
       " %9,"
@@ -40845,19 +13358,19 @@ struct SM90_64x32x32_F16E5M2E4M3_SS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x32x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x32x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x32x32 TN F16+=E5M2*E4M3
+// GMMA 64x32x32 TN F16+=E4M3*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x32x32_F16E5M2E4M3_RS_TN
+struct MMA_64x32x32_F16E4M3E4M3_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
@@ -40872,11 +13385,12 @@ struct SM90_64x32x32_F16E5M2E4M3_RS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %13, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n32k32.f16.e5m2.e4m3 "
+      "wgmma.mma_async.sync.aligned.m64n32k32.f16.e4m3.e4m3 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
       "{%8,  %9,  %10, %11},"
       " %12,"
@@ -40888,19 +13402,19 @@ struct SM90_64x32x32_F16E5M2E4M3_RS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x32x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x32x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x32x32 TN F32+=E5M2*E4M3
+// GMMA 64x32x32 TN F32+=E4M3*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x32x32_F32E5M2E4M3_SS_TN
+struct MMA_64x32x32_F32E4M3E4M3_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
@@ -40917,11 +13431,12 @@ struct SM90_64x32x32_F32E5M2E4M3_SS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %18, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n32k32.f32.e5m2.e4m3 "
+      "wgmma.mma_async.sync.aligned.m64n32k32.f32.e4m3.e4m3 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
       " %8,  %9,  %10, %11, %12, %13, %14, %15},"
       " %16,"
@@ -40936,19 +13451,19 @@ struct SM90_64x32x32_F32E5M2E4M3_SS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x32x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x32x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x32x32 TN F32+=E5M2*E4M3
+// GMMA 64x32x32 TN F32+=E4M3*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x32x32_F32E5M2E4M3_RS_TN
+struct MMA_64x32x32_F32E4M3E4M3_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
@@ -40965,11 +13480,12 @@ struct SM90_64x32x32_F32E5M2E4M3_RS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %21, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n32k32.f32.e5m2.e4m3 "
+      "wgmma.mma_async.sync.aligned.m64n32k32.f32.e4m3.e4m3 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
       " %8,  %9,  %10, %11, %12, %13, %14, %15},"
       "{%16, %17, %18, %19},"
@@ -40984,225 +13500,19 @@ struct SM90_64x32x32_F32E5M2E4M3_RS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x32x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x48x32 TN F16+=E5M2*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x48x32_F16E5M2E4M3_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[12];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %14, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n48k32.f16.e5m2.e4m3 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11},"
-      " %12,"
-      " %13,"
-      " p,   %15, %16;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x48x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x48x32 TN F16+=E5M2*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x48x32_F16E5M2E4M3_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[12];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %17, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n48k32.f16.e5m2.e4m3 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11},"
-      "{%12, %13, %14, %15},"
-      " %16,"
-      " p,   %18, %19;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x48x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x48x32 TN F32+=E5M2*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x48x32_F32E5M2E4M3_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[24];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %26, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n48k32.f32.e5m2.e4m3 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23},"
-      " %24,"
-      " %25,"
-      " p,   %27, %28;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x48x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x48x32 TN F32+=E5M2*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x48x32_F32E5M2E4M3_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[24];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %29, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n48k32.f32.e5m2.e4m3 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23},"
-      "{%24, %25, %26, %27},"
-      " %28,"
-      " p,   %30, %31;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x48x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x32x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x64x32 TN F16+=E5M2*E4M3
+// GMMA 64x64x32 TN F16+=E4M3*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x64x32_F16E5M2E4M3_SS_TN
+struct MMA_64x64x32_F16E4M3E4M3_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
@@ -41219,11 +13529,12 @@ struct SM90_64x64x32_F16E5M2E4M3_SS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %18, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n64k32.f16.e5m2.e4m3 "
+      "wgmma.mma_async.sync.aligned.m64n64k32.f16.e4m3.e4m3 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
       " %8,  %9,  %10, %11, %12, %13, %14, %15},"
       " %16,"
@@ -41238,19 +13549,19 @@ struct SM90_64x64x32_F16E5M2E4M3_SS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x64x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x64x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x64x32 TN F16+=E5M2*E4M3
+// GMMA 64x64x32 TN F16+=E4M3*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x64x32_F16E5M2E4M3_RS_TN
+struct MMA_64x64x32_F16E4M3E4M3_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
@@ -41267,11 +13578,12 @@ struct SM90_64x64x32_F16E5M2E4M3_RS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %21, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n64k32.f16.e5m2.e4m3 "
+      "wgmma.mma_async.sync.aligned.m64n64k32.f16.e4m3.e4m3 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
       " %8,  %9,  %10, %11, %12, %13, %14, %15},"
       "{%16, %17, %18, %19},"
@@ -41286,19 +13598,19 @@ struct SM90_64x64x32_F16E5M2E4M3_RS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x64x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x64x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x64x32 TN F32+=E5M2*E4M3
+// GMMA 64x64x32 TN F32+=E4M3*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x64x32_F32E5M2E4M3_SS_TN
+struct MMA_64x64x32_F32E4M3E4M3_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
@@ -41319,11 +13631,12 @@ struct SM90_64x64x32_F32E5M2E4M3_SS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %34, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n64k32.f32.e5m2.e4m3 "
+      "wgmma.mma_async.sync.aligned.m64n64k32.f32.e4m3.e4m3 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
       " %8,  %9,  %10, %11, %12, %13, %14, %15, "
       " %16, %17, %18, %19, %20, %21, %22, %23, "
@@ -41344,19 +13657,19 @@ struct SM90_64x64x32_F32E5M2E4M3_SS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x64x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x64x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x64x32 TN F32+=E5M2*E4M3
+// GMMA 64x64x32 TN F32+=E4M3*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x64x32_F32E5M2E4M3_RS_TN
+struct MMA_64x64x32_F32E4M3E4M3_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
@@ -41377,11 +13690,12 @@ struct SM90_64x64x32_F32E5M2E4M3_RS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %37, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n64k32.f32.e5m2.e4m3 "
+      "wgmma.mma_async.sync.aligned.m64n64k32.f32.e4m3.e4m3 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
       " %8,  %9,  %10, %11, %12, %13, %14, %15, "
       " %16, %17, %18, %19, %20, %21, %22, %23, "
@@ -41402,255 +13716,19 @@ struct SM90_64x64x32_F32E5M2E4M3_RS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x64x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x80x32 TN F16+=E5M2*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x80x32_F16E5M2E4M3_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[20];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %22, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n80k32.f16.e5m2.e4m3 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19},"
-      " %20,"
-      " %21,"
-      " p,   %23, %24;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x80x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x80x32 TN F16+=E5M2*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x80x32_F16E5M2E4M3_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[20];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %25, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n80k32.f16.e5m2.e4m3 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19},"
-      "{%20, %21, %22, %23},"
-      " %24,"
-      " p,   %26, %27;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x80x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x80x32 TN F32+=E5M2*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x80x32_F32E5M2E4M3_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[40];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %42, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n80k32.f32.e5m2.e4m3 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39},"
-      " %40,"
-      " %41,"
-      " p,   %43, %44;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x80x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x80x32 TN F32+=E5M2*E4M3
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x80x32_F32E5M2E4M3_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[40];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %45, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n80k32.f32.e5m2.e4m3 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39},"
-      "{%40, %41, %42, %43},"
-      " %44,"
-      " p,   %46, %47;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x80x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x64x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x96x32 TN F16+=E5M2*E4M3
+// GMMA 64x96x32 TN F16+=E4M3*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x96x32_F16E5M2E4M3_SS_TN
+struct MMA_64x96x32_F16E4M3E4M3_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
@@ -41669,11 +13747,12 @@ struct SM90_64x96x32_F16E5M2E4M3_SS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %26, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n96k32.f16.e5m2.e4m3 "
+      "wgmma.mma_async.sync.aligned.m64n96k32.f16.e4m3.e4m3 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
       " %8,  %9,  %10, %11, %12, %13, %14, %15, "
       " %16, %17, %18, %19, %20, %21, %22, %23},"
@@ -41691,19 +13770,19 @@ struct SM90_64x96x32_F16E5M2E4M3_SS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x96x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x96x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x96x32 TN F16+=E5M2*E4M3
+// GMMA 64x96x32 TN F16+=E4M3*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x96x32_F16E5M2E4M3_RS_TN
+struct MMA_64x96x32_F16E4M3E4M3_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
@@ -41722,11 +13801,12 @@ struct SM90_64x96x32_F16E5M2E4M3_RS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %29, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n96k32.f16.e5m2.e4m3 "
+      "wgmma.mma_async.sync.aligned.m64n96k32.f16.e4m3.e4m3 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
       " %8,  %9,  %10, %11, %12, %13, %14, %15, "
       " %16, %17, %18, %19, %20, %21, %22, %23},"
@@ -41744,19 +13824,19 @@ struct SM90_64x96x32_F16E5M2E4M3_RS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x96x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x96x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x96x32 TN F32+=E5M2*E4M3
+// GMMA 64x96x32 TN F32+=E4M3*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x96x32_F32E5M2E4M3_SS_TN
+struct MMA_64x96x32_F32E4M3E4M3_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
@@ -41781,11 +13861,12 @@ struct SM90_64x96x32_F32E5M2E4M3_SS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %50, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n96k32.f32.e5m2.e4m3 "
+      "wgmma.mma_async.sync.aligned.m64n96k32.f32.e4m3.e4m3 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
       " %8,  %9,  %10, %11, %12, %13, %14, %15, "
       " %16, %17, %18, %19, %20, %21, %22, %23, "
@@ -41812,19 +13893,19 @@ struct SM90_64x96x32_F32E5M2E4M3_SS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x96x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x96x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x96x32 TN F32+=E5M2*E4M3
+// GMMA 64x96x32 TN F32+=E4M3*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x96x32_F32E5M2E4M3_RS_TN
+struct MMA_64x96x32_F32E4M3E4M3_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
@@ -41849,11 +13930,12 @@ struct SM90_64x96x32_F32E5M2E4M3_RS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %53, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n96k32.f32.e5m2.e4m3 "
+      "wgmma.mma_async.sync.aligned.m64n96k32.f32.e4m3.e4m3 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -41880,25 +13962,24 @@ struct SM90_64x96x32_F32E5M2E4M3_RS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x96x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x96x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x112x32 TN F16+=E5M2*E4M3
+// GMMA 64x128x32 TN F16+=E4M3*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x112x32_F16E5M2E4M3_SS_TN
+struct MMA_64x128x32_F16E4M3E4M3_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[28];
+  using CRegisters = uint32_t[32];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
@@ -41910,21 +13991,23 @@ struct SM90_64x112x32_F16E5M2E4M3_SS_TN
       uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
       uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
       uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %30, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n112k32.f16.e5m2.e4m3 "
+      "setp.ne.b32 p, %34, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n128k32.f16.e4m3.e4m3 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
       " %8,  %9,  %10, %11, %12, %13, %14, %15, "
       " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27},"
-      " %28,"
-      " %29,"
-      " p,   %31, %32;\n"
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      " %32,"
+      " %33,"
+      " p,   %35, %36;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -41932,31 +14015,30 @@ struct SM90_64x112x32_F16E5M2E4M3_SS_TN
         "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
         "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
         "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27)
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x112x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x128x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x112x32 TN F16+=E5M2*E4M3
+// GMMA 64x128x32 TN F16+=E4M3*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x112x32_F16E5M2E4M3_RS_TN
+struct MMA_64x128x32_F16E4M3E4M3_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[28];
+  using CRegisters = uint32_t[32];
 
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
@@ -41968,21 +14050,23 @@ struct SM90_64x112x32_F16E5M2E4M3_RS_TN
       uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
       uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
       uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %33, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n112k32.f16.e5m2.e4m3 "
+      "setp.ne.b32 p, %37, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n128k32.f16.e4m3.e4m3 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
       " %8,  %9,  %10, %11, %12, %13, %14, %15, "
       " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27},"
-      "{%28, %29, %30, %31},"
-      " %32,"
-      " p,   %34, %35;\n"
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      "{%32, %33, %34, %35},"
+      " %36,"
+      " p,   %38, %39;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -41990,31 +14074,30 @@ struct SM90_64x112x32_F16E5M2E4M3_RS_TN
         "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
         "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
         "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27)
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x112x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x128x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x112x32 TN F32+=E5M2*E4M3
+// GMMA 64x128x32 TN F32+=E4M3*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x112x32_F32E5M2E4M3_SS_TN
+struct MMA_64x128x32_F32E4M3E4M3_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[56];
+  using CRegisters = float[64];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
@@ -42033,24 +14116,28 @@ struct SM90_64x112x32_F32E5M2E4M3_SS_TN
       float         & d44, float         & d45, float         & d46, float         & d47,
       float         & d48, float         & d49, float         & d50, float         & d51,
       float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %58, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n112k32.f32.e5m2.e4m3 "
+      "setp.ne.b32 p, %66, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n128k32.f32.e4m3.e4m3 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
       " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
       " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
-      " %56,"
-      " %57,"
-      " p,    %59,  %60;\n"
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      " %64,"
+      " %65,"
+      " p,    %67,  %68;\n"
     "}\n"
       : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
         "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
@@ -42065,31 +14152,31 @@ struct SM90_64x112x32_F32E5M2E4M3_SS_TN
         "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
         "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
         "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x112x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x128x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x112x32 TN F32+=E5M2*E4M3
+// GMMA 64x128x32 TN F32+=E4M3*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x112x32_F32E5M2E4M3_RS_TN
+struct MMA_64x128x32_F32E4M3E4M3_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[56];
+  using CRegisters = float[64];
 
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
@@ -42108,24 +14195,28 @@ struct SM90_64x112x32_F32E5M2E4M3_RS_TN
       float         & d44, float         & d45, float         & d46, float         & d47,
       float         & d48, float         & d49, float         & d50, float         & d51,
       float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %61, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n112k32.f32.e5m2.e4m3 "
+      "setp.ne.b32 p, %69, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n128k32.f32.e4m3.e4m3 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
       " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
       " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
-      "{%56,  %57,  %58,  %59},"
-      " %60,"
-      " p,    %62,  %63;\n"
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      "{%64,  %65,  %66,  %67},"
+      " %68,"
+      " p,    %70,  %71;\n"
     "}\n"
       : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
         "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
@@ -42140,30 +14231,31 @@ struct SM90_64x112x32_F32E5M2E4M3_RS_TN
         "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
         "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
         "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x112x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x128x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x128x32 TN F16+=E5M2*E4M3
+// GMMA 64x192x32 TN F16+=E4M3*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x128x32_F16E5M2E4M3_SS_TN
+struct MMA_64x192x32_F16E4M3E4M3_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[32];
+  using CRegisters = uint32_t[48];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
@@ -42176,21 +14268,28 @@ struct SM90_64x128x32_F16E5M2E4M3_SS_TN
       uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
       uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
       uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %34, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n128k32.f16.e5m2.e4m3 "
+      "setp.ne.b32 p, %50, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n192k32.f16.e4m3.e4m3 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
       " %8,  %9,  %10, %11, %12, %13, %14, %15, "
       " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31},"
-      " %32,"
-      " %33,"
-      " p,   %35, %36;\n"
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45, %46, %47},"
+      " %48,"
+      " %49,"
+      " p,   %51, %52;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -42199,29 +14298,33 @@ struct SM90_64x128x32_F16E5M2E4M3_SS_TN
         "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
         "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
         "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x128x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x192x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x128x32 TN F16+=E5M2*E4M3
+// GMMA 64x192x32 TN F16+=E4M3*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x128x32_F16E5M2E4M3_RS_TN
+struct MMA_64x192x32_F16E4M3E4M3_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[32];
+  using CRegisters = uint32_t[48];
 
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
@@ -42234,21 +14337,28 @@ struct SM90_64x128x32_F16E5M2E4M3_RS_TN
       uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
       uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
       uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %37, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n128k32.f16.e5m2.e4m3 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31},"
-      "{%32, %33, %34, %35},"
-      " %36,"
-      " p,   %38, %39;\n"
+      "setp.ne.b32 p, %53, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n192k32.f16.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
+      "{%48,  %49,  %50,  %51},"
+      " %52,"
+      " p,    %54,  %55;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -42257,29 +14367,33 @@ struct SM90_64x128x32_F16E5M2E4M3_RS_TN
         "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
         "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
         "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x128x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x192x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x128x32 TN F32+=E5M2*E4M3
+// GMMA 64x192x32 TN F32+=E4M3*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x128x32_F32E5M2E4M3_SS_TN
+struct MMA_64x192x32_F32E4M3E4M3_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[64];
+  using CRegisters = float[96];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
@@ -42300,14 +14414,23 @@ struct SM90_64x128x32_F32E5M2E4M3_SS_TN
       float         & d52, float         & d53, float         & d54, float         & d55,
       float         & d56, float         & d57, float         & d58, float         & d59,
       float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      float         & d92, float         & d93, float         & d94, float         & d95,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %66, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n128k32.f32.e5m2.e4m3 "
+      "setp.ne.b32 p, %98, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n192k32.f32.e4m3.e4m3 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -42315,10 +14438,14 @@ struct SM90_64x128x32_F32E5M2E4M3_SS_TN
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
       " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
       " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
-      " %64,"
-      " %65,"
-      " p,    %67,  %68;\n"
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      " %96,"
+      " %97,"
+      " p,    %99,  %100;\n"
     "}\n"
       : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
         "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
@@ -42335,29 +14462,37 @@ struct SM90_64x128x32_F32E5M2E4M3_SS_TN
         "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
         "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
         "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91),
+        "+f"(d92), "+f"(d93), "+f"(d94), "+f"(d95)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x128x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x192x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x128x32 TN F32+=E5M2*E4M3
+// GMMA 64x192x32 TN F32+=E4M3*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x128x32_F32E5M2E4M3_RS_TN
+struct MMA_64x192x32_F32E4M3E4M3_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[64];
+  using CRegisters = float[96];
 
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
@@ -42378,14 +14513,23 @@ struct SM90_64x128x32_F32E5M2E4M3_RS_TN
       float         & d52, float         & d53, float         & d54, float         & d55,
       float         & d56, float         & d57, float         & d58, float         & d59,
       float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      float         & d92, float         & d93, float         & d94, float         & d95,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %69, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n128k32.f32.e5m2.e4m3 "
+      "setp.ne.b32 p, %101, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n192k32.f32.e4m3.e4m3 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -42393,10 +14537,14 @@ struct SM90_64x128x32_F32E5M2E4M3_RS_TN
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
       " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
       " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
-      "{%64,  %65,  %66,  %67},"
-      " %68,"
-      " p,    %70,  %71;\n"
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      "{%96,  %97,  %98,  %99},"
+      " %100,"
+      " p,    %102, %103;\n"
     "}\n"
       : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
         "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
@@ -42413,30 +14561,37 @@ struct SM90_64x128x32_F32E5M2E4M3_RS_TN
         "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
         "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
         "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91),
+        "+f"(d92), "+f"(d93), "+f"(d94), "+f"(d95)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x128x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x192x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x144x32 TN F16+=E5M2*E4M3
+// GMMA 64x256x32 TN F16+=E4M3*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x144x32_F16E5M2E4M3_SS_TN
+struct MMA_64x256x32_F16E4M3E4M3_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[36];
+  using CRegisters = uint32_t[64];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
@@ -42450,22 +14605,33 @@ struct SM90_64x144x32_F16E5M2E4M3_SS_TN
       uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
       uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
       uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %38, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n144k32.f16.e5m2.e4m3 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35},"
-      " %36,"
-      " %37,"
-      " p,   %39, %40;\n"
+      "setp.ne.b32 p, %66, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n256k32.f16.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      " %64,"
+      " %65,"
+      " p,    %67,  %68;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -42475,31 +14641,36 @@ struct SM90_64x144x32_F16E5M2E4M3_SS_TN
         "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
         "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
         "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35)
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x144x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x256x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x144x32 TN F16+=E5M2*E4M3
+// GMMA 64x256x32 TN F16+=E4M3*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x144x32_F16E5M2E4M3_RS_TN
+struct MMA_64x256x32_F16E4M3E4M3_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[36];
+  using CRegisters = uint32_t[64];
 
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
@@ -42513,22 +14684,33 @@ struct SM90_64x144x32_F16E5M2E4M3_RS_TN
       uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
       uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
       uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %41, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n144k32.f16.e5m2.e4m3 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35},"
-      "{%36, %37, %38, %39},"
-      " %40,"
-      " p,   %42, %43;\n"
+      "setp.ne.b32 p, %69, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n256k32.f16.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      "{%64,  %65,  %66,  %67},"
+      " %68,"
+      " p,    %70,  %71;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -42538,61 +14720,81 @@ struct SM90_64x144x32_F16E5M2E4M3_RS_TN
         "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
         "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
         "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35)
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x144x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x256x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x144x32 TN F32+=E5M2*E4M3
+// GMMA 64x256x32 TN F32+=E4M3*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x144x32_F32E5M2E4M3_SS_TN
+struct MMA_64x256x32_F32E4M3E4M3_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[72];
+  using CRegisters = float[128];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      float         & d124, float         & d125, float         & d126, float         & d127,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %74, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n144k32.f32.e5m2.e4m3 "
+      "setp.ne.b32 p, %130, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n256k32.f32.e4m3.e4m3 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -42601,83 +14803,117 @@ struct SM90_64x144x32_F32E5M2E4M3_SS_TN
       " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
       " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
       " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
-      " %72,"
-      " %73,"
-      " p,    %75,  %76;\n"
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      " %128,"
+      " %129,"
+      " p,    %131, %132;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71)
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123),
+        "+f"(d124), "+f"(d125), "+f"(d126), "+f"(d127)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x144x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x256x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x144x32 TN F32+=E5M2*E4M3
+// GMMA 64x256x32 TN F32+=E4M3*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x144x32_F32E5M2E4M3_RS_TN
+struct MMA_64x256x32_F32E4M3E4M3_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[72];
+  using CRegisters = float[128];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      float         & d124, float         & d125, float         & d126, float         & d127,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %77, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n144k32.f32.e5m2.e4m3 "
+      "setp.ne.b32 p, %133, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n256k32.f32.e4m3.e4m3 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -42686,499 +14922,500 @@ struct SM90_64x144x32_F32E5M2E4M3_RS_TN
       " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
       " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
       " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
-      "{%72,  %73,  %74,  %75},"
-      " %76,"
-      " p,    %78,  %79;\n"
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      "{%128, %129, %130, %131},"
+      " %132,"
+      " p,    %134, %135;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123),
+        "+f"(d124), "+f"(d125), "+f"(d126), "+f"(d127)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x256x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x8x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x8x32_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[2];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %4, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n8k32.f16.e4m3.e5m2 "
+      "{%0, %1},"
+      " %2,"
+      " %3,"
+      " p,  %5, %6;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x8x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x8x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x8x32_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[2];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %7, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n8k32.f16.e4m3.e5m2 "
+      "{%0,  %1},"
+      "{%2,  %3,  %4,  %5},"
+      " %6,"
+      " p,   %8,  %9;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x8x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x8x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x8x32_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d0, float         & d1, float         & d2, float         & d3,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %6, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n8k32.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3},"
+      " %4,"
+      " %5,"
+      " p,   %7,  %8;\n"
+    "}\n"
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x8x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x8x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x8x32_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      float         & d0, float         & d1, float         & d2, float         & d3,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %9, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n8k32.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3},"
+      "{%4,  %5,  %6,  %7},"
+      " %8,"
+      " p,   %10, %11;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x144x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x8x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x160x32 TN F16+=E5M2*E4M3
+// GMMA 64x16x32 TN F16+=E4M3*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x160x32_F16E5M2E4M3_SS_TN
+struct MMA_64x16x32_F16E4M3E5M2_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[40];
+  using CRegisters = uint32_t[4];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %42, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n160k32.f16.e5m2.e4m3 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39},"
-      " %40,"
-      " %41,"
-      " p,   %43, %44;\n"
+      "setp.ne.b32 p, %6, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n16k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3},"
+      " %4,"
+      " %5,"
+      " p,   %7,  %8;\n"
     "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x160x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x16x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x160x32 TN F16+=E5M2*E4M3
+// GMMA 64x16x32 TN F16+=E4M3*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x160x32_F16E5M2E4M3_RS_TN
+struct MMA_64x16x32_F16E4M3E5M2_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[40];
+  using CRegisters = uint32_t[4];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
       uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %45, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n160k32.f16.e5m2.e4m3 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39},"
-      "{%40, %41, %42, %43},"
-      " %44,"
-      " p,   %46, %47;\n"
+      "setp.ne.b32 p, %9, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n16k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3},"
+      "{%4,  %5,  %6,  %7},"
+      " %8,"
+      " p,   %10, %11;\n"
     "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x160x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x16x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x160x32 TN F32+=E5M2*E4M3
+// GMMA 64x16x32 TN F32+=E4M3*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x160x32_F32E5M2E4M3_SS_TN
+struct MMA_64x16x32_F32E4M3E5M2_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[80];
+  using CRegisters = float[8];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      float         & d72, float         & d73, float         & d74, float         & d75,
-      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d0, float         & d1, float         & d2, float         & d3,
+      float         & d4, float         & d5, float         & d6, float         & d7,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %82, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n160k32.f32.e5m2.e4m3 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
-      " %80,"
-      " %81,"
-      " p,    %83,  %84;\n"
+      "setp.ne.b32 p, %10, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n16k32.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      " %8,"
+      " %9,"
+      " p,   %11, %12;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
-        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
-        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79)
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3),
+        "+f"(d4), "+f"(d5), "+f"(d6), "+f"(d7)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x160x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x16x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x160x32 TN F32+=E5M2*E4M3
+// GMMA 64x16x32 TN F32+=E4M3*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x160x32_F32E5M2E4M3_RS_TN
+struct MMA_64x16x32_F32E4M3E5M2_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[80];
+  using CRegisters = float[8];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      float         & d72, float         & d73, float         & d74, float         & d75,
-      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d0, float         & d1, float         & d2, float         & d3,
+      float         & d4, float         & d5, float         & d6, float         & d7,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %85, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n160k32.f32.e5m2.e4m3 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
-      "{%80,  %81,  %82,  %83},"
-      " %84,"
-      " p,    %86,  %87;\n"
+      "setp.ne.b32 p, %13, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n16k32.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      "{%8,  %9,  %10, %11},"
+      " %12,"
+      " p,   %14, %15;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
-        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
-        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3),
+        "+f"(d4), "+f"(d5), "+f"(d6), "+f"(d7)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x160x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x16x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x176x32 TN F16+=E5M2*E4M3
+// GMMA 64x32x32 TN F16+=E4M3*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x176x32_F16E5M2E4M3_SS_TN
+struct MMA_64x32x32_F16E4M3E5M2_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[44];
+  using CRegisters = uint32_t[8];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %46, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n176k32.f16.e5m2.e4m3 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39, "
-      " %40, %41, %42, %43},"
-      " %44,"
-      " %45,"
-      " p,   %47, %48;\n"
+      "setp.ne.b32 p, %10, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n32k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      " %8,"
+      " %9,"
+      " p,   %11, %12;\n"
     "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43)
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x176x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x32x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x176x32 TN F16+=E5M2*E4M3
+// GMMA 64x32x32 TN F16+=E4M3*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x176x32_F16E5M2E4M3_RS_TN
+struct MMA_64x32x32_F16E4M3E5M2_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[44];
+  using CRegisters = uint32_t[8];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
       uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %49, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n176k32.f16.e5m2.e4m3 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39, "
-      " %40, %41, %42, %43},"
-      "{%44, %45, %46, %47},"
-      " %48,"
-      " p,   %50, %51;\n"
+      "setp.ne.b32 p, %13, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n32k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      "{%8,  %9,  %10, %11},"
+      " %12,"
+      " p,   %14, %15;\n"
     "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x176x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x32x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x176x32 TN F32+=E5M2*E4M3
+// GMMA 64x32x32 TN F32+=E4M3*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x176x32_F32E5M2E4M3_SS_TN
+struct MMA_64x32x32_F32E4M3E5M2_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[88];
+  using CRegisters = float[16];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
@@ -43187,93 +15424,47 @@ struct SM90_64x176x32_F32E5M2E4M3_SS_TN
       float         & d04, float         & d05, float         & d06, float         & d07,
       float         & d08, float         & d09, float         & d10, float         & d11,
       float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      float         & d72, float         & d73, float         & d74, float         & d75,
-      float         & d76, float         & d77, float         & d78, float         & d79,
-      float         & d80, float         & d81, float         & d82, float         & d83,
-      float         & d84, float         & d85, float         & d86, float         & d87,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %90, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n176k32.f32.e5m2.e4m3 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
-      " %88,"
-      " %89,"
-      " p,    %91,  %92;\n"
+      "setp.ne.b32 p, %18, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n32k32.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      " %16,"
+      " %17,"
+      " p,   %19, %20;\n"
     "}\n"
       : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
         "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
         "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
-        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
-        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
-        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
-        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87)
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x176x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x32x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x176x32 TN F32+=E5M2*E4M3
+// GMMA 64x32x32 TN F32+=E4M3*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x176x32_F32E5M2E4M3_RS_TN
+struct MMA_64x32x32_F32E4M3E5M2_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[88];
+  using CRegisters = float[16];
 
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
@@ -43282,92 +15473,47 @@ struct SM90_64x176x32_F32E5M2E4M3_RS_TN
       float         & d04, float         & d05, float         & d06, float         & d07,
       float         & d08, float         & d09, float         & d10, float         & d11,
       float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      float         & d72, float         & d73, float         & d74, float         & d75,
-      float         & d76, float         & d77, float         & d78, float         & d79,
-      float         & d80, float         & d81, float         & d82, float         & d83,
-      float         & d84, float         & d85, float         & d86, float         & d87,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %93, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n176k32.f32.e5m2.e4m3 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
-      "{%88,  %89,  %90,  %91},"
-      " %92,"
-      " p,    %94,  %95;\n"
+      "setp.ne.b32 p, %21, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n32k32.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      "{%16, %17, %18, %19},"
+      " %20,"
+      " p,   %22, %23;\n"
     "}\n"
       : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
         "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
         "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
-        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
-        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
-        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
-        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87)
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x176x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x32x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x192x32 TN F16+=E5M2*E4M3
+// GMMA 64x64x32 TN F16+=E4M3*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x192x32_F16E5M2E4M3_SS_TN
+struct MMA_64x64x32_F16E4M3E5M2_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[48];
+  using CRegisters = uint32_t[16];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
@@ -43376,66 +15522,47 @@ struct SM90_64x192x32_F16E5M2E4M3_SS_TN
       uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
       uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
       uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %50, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n192k32.f16.e5m2.e4m3 "
+      "setp.ne.b32 p, %18, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n64k32.f16.e4m3.e5m2 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39, "
-      " %40, %41, %42, %43, %44, %45, %46, %47},"
-      " %48,"
-      " %49,"
-      " p,   %51, %52;\n"
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      " %16,"
+      " %17,"
+      " p,   %19, %20;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
         "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x192x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x64x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x192x32 TN F16+=E5M2*E4M3
+// GMMA 64x64x32 TN F16+=E4M3*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x192x32_F16E5M2E4M3_RS_TN
+struct MMA_64x64x32_F16E4M3E5M2_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[48];
+  using CRegisters = uint32_t[16];
 
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
@@ -43444,66 +15571,47 @@ struct SM90_64x192x32_F16E5M2E4M3_RS_TN
       uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
       uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
       uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %53, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n192k32.f16.e5m2.e4m3 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
-      "{%48,  %49,  %50,  %51},"
-      " %52,"
-      " p,    %54,  %55;\n"
+      "setp.ne.b32 p, %21, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n64k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      "{%16, %17, %18, %19},"
+      " %20,"
+      " p,   %22, %23;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
         "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x192x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x64x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x192x32 TN F32+=E5M2*E4M3
+// GMMA 64x64x32 TN F32+=E4M3*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x192x32_F32E5M2E4M3_SS_TN
+struct MMA_64x64x32_F32E4M3E5M2_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[96];
+  using CRegisters = float[32];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
@@ -43516,45 +15624,22 @@ struct SM90_64x192x32_F32E5M2E4M3_SS_TN
       float         & d20, float         & d21, float         & d22, float         & d23,
       float         & d24, float         & d25, float         & d26, float         & d27,
       float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      float         & d72, float         & d73, float         & d74, float         & d75,
-      float         & d76, float         & d77, float         & d78, float         & d79,
-      float         & d80, float         & d81, float         & d82, float         & d83,
-      float         & d84, float         & d85, float         & d86, float         & d87,
-      float         & d88, float         & d89, float         & d90, float         & d91,
-      float         & d92, float         & d93, float         & d94, float         & d95,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %98, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n192k32.f32.e5m2.e4m3 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
-      " %96,"
-      " %97,"
-      " p,    %99,  %100;\n"
+      "setp.ne.b32 p, %34, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n64k32.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      " %32,"
+      " %33,"
+      " p,   %35, %36;\n"
     "}\n"
       : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
         "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
@@ -43563,45 +15648,29 @@ struct SM90_64x192x32_F32E5M2E4M3_SS_TN
         "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
         "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
         "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
-        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
-        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
-        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
-        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
-        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91),
-        "+f"(d92), "+f"(d93), "+f"(d94), "+f"(d95)
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x192x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x64x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x192x32 TN F32+=E5M2*E4M3
+// GMMA 64x64x32 TN F32+=E4M3*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x192x32_F32E5M2E4M3_RS_TN
+struct MMA_64x64x32_F32E4M3E5M2_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[96];
+  using CRegisters = float[32];
 
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
@@ -43614,45 +15683,22 @@ struct SM90_64x192x32_F32E5M2E4M3_RS_TN
       float         & d20, float         & d21, float         & d22, float         & d23,
       float         & d24, float         & d25, float         & d26, float         & d27,
       float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      float         & d72, float         & d73, float         & d74, float         & d75,
-      float         & d76, float         & d77, float         & d78, float         & d79,
-      float         & d80, float         & d81, float         & d82, float         & d83,
-      float         & d84, float         & d85, float         & d86, float         & d87,
-      float         & d88, float         & d89, float         & d90, float         & d91,
-      float         & d92, float         & d93, float         & d94, float         & d95,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %101, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n192k32.f32.e5m2.e4m3 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
-      "{%96,  %97,  %98,  %99},"
-      " %100,"
-      " p,    %102, %103;\n"
+      "setp.ne.b32 p, %37, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n64k32.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      "{%32, %33, %34, %35},"
+      " %36,"
+      " p,   %38, %39;\n"
     "}\n"
       : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
         "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
@@ -43661,46 +15707,29 @@ struct SM90_64x192x32_F32E5M2E4M3_RS_TN
         "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
         "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
         "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
-        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
-        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
-        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
-        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
-        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91),
-        "+f"(d92), "+f"(d93), "+f"(d94), "+f"(d95)
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x192x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x64x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x208x32 TN F16+=E5M2*E4M3
+// GMMA 64x96x32 TN F16+=E4M3*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x208x32_F16E5M2E4M3_SS_TN
+struct MMA_64x96x32_F16E4M3E5M2_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[52];
+  using CRegisters = uint32_t[24];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
@@ -43711,69 +15740,50 @@ struct SM90_64x208x32_F16E5M2E4M3_SS_TN
       uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
       uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
       uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %54, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n208k32.f16.e5m2.e4m3 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51},"
-      " %52,"
-      " %53,"
-      " p,    %55,  %56;\n"
+      "setp.ne.b32 p, %26, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n96k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      " %24,"
+      " %25,"
+      " p,   %27, %28;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
         "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
         "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
         "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51)
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x208x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x96x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x208x32 TN F16+=E5M2*E4M3
+// GMMA 64x96x32 TN F16+=E4M3*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x208x32_F16E5M2E4M3_RS_TN
+struct MMA_64x96x32_F16E4M3E5M2_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[52];
+  using CRegisters = uint32_t[24];
 
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
@@ -43784,279 +15794,188 @@ struct SM90_64x208x32_F16E5M2E4M3_RS_TN
       uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
       uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
       uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %57, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n208k32.f16.e5m2.e4m3 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51},"
-      "{%52,  %53,  %54,  %55},"
-      " %56,"
-      " p,    %58,  %59;\n"
+      "setp.ne.b32 p, %29, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n96k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      "{%24, %25, %26, %27},"
+      " %28,"
+      " p,   %30, %31;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
         "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
         "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
         "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51)
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x208x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x96x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x208x32 TN F32+=E5M2*E4M3
+// GMMA 64x96x32 TN F32+=E4M3*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x208x32_F32E5M2E4M3_SS_TN
+struct MMA_64x96x32_F32E4M3E5M2_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[104];
+  using CRegisters = float[48];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %106, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n208k32.f32.e5m2.e4m3 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
-      " %104,"
-      " %105,"
-      " p,    %107, %108;\n"
+      "setp.ne.b32 p, %50, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n96k32.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45, %46, %47},"
+      " %48,"
+      " %49,"
+      " p,   %51, %52;\n"
     "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103)
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x208x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x96x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x208x32 TN F32+=E5M2*E4M3
+// GMMA 64x96x32 TN F32+=E4M3*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x208x32_F32E5M2E4M3_RS_TN
+struct MMA_64x96x32_F32E4M3E5M2_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[104];
+  using CRegisters = float[48];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
       uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %109, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n208k32.f32.e5m2.e4m3 "
+      "setp.ne.b32 p, %53, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n96k32.f32.e4m3.e5m2 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
       " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
-      "{%104, %105, %106, %107},"
-      " %108,"
-      " p,    %110, %111;\n"
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
+      "{%48,  %49,  %50,  %51},"
+      " %52,"
+      " p,    %54,  %55;\n"
     "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x208x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x96x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x224x32 TN F16+=E5M2*E4M3
+// GMMA 64x128x32 TN F16+=E4M3*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x224x32_F16E5M2E4M3_SS_TN
+struct MMA_64x128x32_F16E4M3E5M2_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[56];
+  using CRegisters = uint32_t[32];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
@@ -44069,30 +15988,22 @@ struct SM90_64x224x32_F16E5M2E4M3_SS_TN
       uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
       uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
       uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %58, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n224k32.f16.e5m2.e4m3 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
-      " %56,"
-      " %57,"
-      " p,    %59,  %60;\n"
+      "setp.ne.b32 p, %34, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n128k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      " %32,"
+      " %33,"
+      " p,   %35, %36;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -44101,37 +16012,29 @@ struct SM90_64x224x32_F16E5M2E4M3_SS_TN
         "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
         "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
         "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x224x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x128x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x224x32 TN F16+=E5M2*E4M3
+// GMMA 64x128x32 TN F16+=E4M3*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x224x32_F16E5M2E4M3_RS_TN
+struct MMA_64x128x32_F16E4M3E5M2_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[56];
+  using CRegisters = uint32_t[32];
 
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
@@ -44144,30 +16047,22 @@ struct SM90_64x224x32_F16E5M2E4M3_RS_TN
       uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
       uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
       uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %61, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n224k32.f16.e5m2.e4m3 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
-      "{%56,  %57,  %58,  %59},"
-      " %60,"
-      " p,    %62,  %63;\n"
+      "setp.ne.b32 p, %37, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n128k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      "{%32, %33, %34, %35},"
+      " %36,"
+      " p,   %38, %39;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -44176,187 +16071,137 @@ struct SM90_64x224x32_F16E5M2E4M3_RS_TN
         "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
         "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
         "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x224x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x128x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x224x32 TN F32+=E5M2*E4M3
+// GMMA 64x128x32 TN F32+=E4M3*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x224x32_F32E5M2E4M3_SS_TN
+struct MMA_64x128x32_F32E4M3E5M2_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[112];
+  using CRegisters = float[64];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
-      float         & d104, float         & d105, float         & d106, float         & d107,
-      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %114, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n224k32.f32.e5m2.e4m3 "
+      "setp.ne.b32 p, %66, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n128k32.f32.e4m3.e5m2 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
       " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111},"
-      " %112,"
-      " %113,"
-      " p,    %115, %116;\n"
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      " %64,"
+      " %65,"
+      " p,    %67,  %68;\n"
     "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
-        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
-        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111)
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x224x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x128x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x224x32 TN F32+=E5M2*E4M3
+// GMMA 64x128x32 TN F32+=E4M3*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x224x32_F32E5M2E4M3_RS_TN
+struct MMA_64x128x32_F32E4M3E5M2_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[112];
+  using CRegisters = float[64];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
       uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
-      float         & d104, float         & d105, float         & d106, float         & d107,
-      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %117, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n224k32.f32.e5m2.e4m3 "
+      "setp.ne.b32 p, %69, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n128k32.f32.e4m3.e5m2 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -44364,69 +16209,49 @@ struct SM90_64x224x32_F32E5M2E4M3_RS_TN
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
       " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
       " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111},"
-      "{%112, %113, %114, %115},"
-      " %116,"
-      " p,    %118, %119;\n"
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      "{%64,  %65,  %66,  %67},"
+      " %68,"
+      " p,    %70,  %71;\n"
     "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
-        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
-        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x224x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x128x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x240x32 TN F16+=E5M2*E4M3
+// GMMA 64x192x32 TN F16+=E4M3*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x240x32_F16E5M2E4M3_SS_TN
+struct MMA_64x192x32_F16E4M3E5M2_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[60];
+  using CRegisters = uint32_t[48];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
@@ -44443,28 +16268,24 @@ struct SM90_64x240x32_F16E5M2E4M3_SS_TN
       uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
       uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
       uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %62, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n240k32.f16.e5m2.e4m3 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59},"
-      " %60,"
-      " %61,"
-      " p,    %63,  %64;\n"
+      "setp.ne.b32 p, %50, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n192k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45, %46, %47},"
+      " %48,"
+      " %49,"
+      " p,   %51, %52;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -44477,34 +16298,29 @@ struct SM90_64x240x32_F16E5M2E4M3_SS_TN
         "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
         "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
         "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59)
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x240x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x192x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x240x32 TN F16+=E5M2*E4M3
+// GMMA 64x192x32 TN F16+=E4M3*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x240x32_F16E5M2E4M3_RS_TN
+struct MMA_64x192x32_F16E4M3E5M2_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[60];
+  using CRegisters = uint32_t[48];
 
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
@@ -44521,28 +16337,24 @@ struct SM90_64x240x32_F16E5M2E4M3_RS_TN
       uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
       uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
       uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %65, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n240k32.f16.e5m2.e4m3 "
+      "setp.ne.b32 p, %53, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n192k32.f16.e4m3.e5m2 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
       " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59},"
-      "{%60,  %61,  %62,  %63},"
-      " %64,"
-      " p,    %66,  %67;\n"
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
+      "{%48,  %49,  %50,  %51},"
+      " %52,"
+      " p,    %54,  %55;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -44555,76 +16367,66 @@ struct SM90_64x240x32_F16E5M2E4M3_RS_TN
         "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
         "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
         "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59)
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x240x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x192x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x240x32 TN F32+=E5M2*E4M3
+// GMMA 64x192x32 TN F32+=E4M3*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x240x32_F32E5M2E4M3_SS_TN
+struct MMA_64x192x32_F32E4M3E5M2_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[120];
+  using CRegisters = float[96];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
-      float         & d104, float         & d105, float         & d106, float         & d107,
-      float         & d108, float         & d109, float         & d110, float         & d111,
-      float         & d112, float         & d113, float         & d114, float         & d115,
-      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      float         & d92, float         & d93, float         & d94, float         & d95,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %122, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n240k32.f32.e5m2.e4m3 "
+      "setp.ne.b32 p, %98, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n192k32.f32.e4m3.e5m2 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -44636,110 +16438,94 @@ struct SM90_64x240x32_F32E5M2E4M3_SS_TN
       " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
       " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
       " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119},"
-      " %120,"
-      " %121,"
-      " p,    %123, %124;\n"
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      " %96,"
+      " %97,"
+      " p,    %99,  %100;\n"
     "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
-        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
-        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
-        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
-        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119)
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91),
+        "+f"(d92), "+f"(d93), "+f"(d94), "+f"(d95)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x240x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x192x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x240x32 TN F32+=E5M2*E4M3
+// GMMA 64x192x32 TN F32+=E4M3*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x240x32_F32E5M2E4M3_RS_TN
+struct MMA_64x192x32_F32E4M3E5M2_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[120];
+  using CRegisters = float[96];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
       uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
-      float         & d104, float         & d105, float         & d106, float         & d107,
-      float         & d108, float         & d109, float         & d110, float         & d111,
-      float         & d112, float         & d113, float         & d114, float         & d115,
-      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      float         & d92, float         & d93, float         & d94, float         & d95,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %125, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n240k32.f32.e5m2.e4m3 "
+      "setp.ne.b32 p, %101, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n192k32.f32.e4m3.e5m2 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -44751,62 +16537,52 @@ struct SM90_64x240x32_F32E5M2E4M3_RS_TN
       " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
       " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
       " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119},"
-      "{%120, %121, %122, %123},"
-      " %124,"
-      " p,    %126, %127;\n"
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      "{%96,  %97,  %98,  %99},"
+      " %100,"
+      " p,    %102, %103;\n"
     "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
-        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
-        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
-        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
-        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91),
+        "+f"(d92), "+f"(d93), "+f"(d94), "+f"(d95)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x240x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x192x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x256x32 TN F16+=E5M2*E4M3
+// GMMA 64x256x32 TN F16+=E4M3*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x256x32_F16E5M2E4M3_SS_TN
+struct MMA_64x256x32_F16E4M3E5M2_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
@@ -44835,11 +16611,12 @@ struct SM90_64x256x32_F16E5M2E4M3_SS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %66, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n256k32.f16.e5m2.e4m3 "
+      "wgmma.mma_async.sync.aligned.m64n256k32.f16.e4m3.e5m2 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -44872,19 +16649,19 @@ struct SM90_64x256x32_F16E5M2E4M3_SS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x256x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x256x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x256x32 TN F16+=E5M2*E4M3
+// GMMA 64x256x32 TN F16+=E4M3*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x256x32_F16E5M2E4M3_RS_TN
+struct MMA_64x256x32_F16E4M3E5M2_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
@@ -44913,11 +16690,12 @@ struct SM90_64x256x32_F16E5M2E4M3_RS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %69, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n256k32.f16.e5m2.e4m3 "
+      "wgmma.mma_async.sync.aligned.m64n256k32.f16.e4m3.e5m2 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -44950,19 +16728,19 @@ struct SM90_64x256x32_F16E5M2E4M3_RS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x256x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x256x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x256x32 TN F32+=E5M2*E4M3
+// GMMA 64x256x32 TN F32+=E4M3*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x256x32_F32E5M2E4M3_SS_TN
+struct MMA_64x256x32_F32E4M3E5M2_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
@@ -45007,11 +16785,12 @@ struct SM90_64x256x32_F32E5M2E4M3_SS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %130, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n256k32.f32.e5m2.e4m3 "
+      "wgmma.mma_async.sync.aligned.m64n256k32.f32.e4m3.e5m2 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -45068,19 +16847,19 @@ struct SM90_64x256x32_F32E5M2E4M3_SS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x256x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x256x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x256x32 TN F32+=E5M2*E4M3
+// GMMA 64x256x32 TN F32+=E4M3*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x256x32_F32E5M2E4M3_RS_TN
+struct MMA_64x256x32_F32E4M3E5M2_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
@@ -45125,11 +16904,12 @@ struct SM90_64x256x32_F32E5M2E4M3_RS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %133, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n256k32.f32.e5m2.e4m3 "
+      "wgmma.mma_async.sync.aligned.m64n256k32.f32.e4m3.e5m2 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -45186,19 +16966,19 @@ struct SM90_64x256x32_F32E5M2E4M3_RS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x256x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x256x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x8x32 TN F16+=E5M2*E5M2
+// GMMA 64x8x32 TN F16+=E5M2*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x8x32_F16E5M2E5M2_SS_TN
+struct MMA_64x8x32_F16E5M2E4M3_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
@@ -45212,11 +16992,12 @@ struct SM90_64x8x32_F16E5M2E5M2_SS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %4, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n8k32.f16.e5m2.e5m2 "
+      "wgmma.mma_async.sync.aligned.m64n8k32.f16.e5m2.e4m3 "
       "{%0, %1},"
       " %2,"
       " %3,"
@@ -45227,19 +17008,19 @@ struct SM90_64x8x32_F16E5M2E5M2_SS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x8x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x8x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x8x32 TN F16+=E5M2*E5M2
+// GMMA 64x8x32 TN F16+=E5M2*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x8x32_F16E5M2E5M2_RS_TN
+struct MMA_64x8x32_F16E5M2E4M3_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
@@ -45253,11 +17034,12 @@ struct SM90_64x8x32_F16E5M2E5M2_RS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %7, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n8k32.f16.e5m2.e5m2 "
+      "wgmma.mma_async.sync.aligned.m64n8k32.f16.e5m2.e4m3 "
       "{%0,  %1},"
       "{%2,  %3,  %4,  %5},"
       " %6,"
@@ -45268,19 +17050,19 @@ struct SM90_64x8x32_F16E5M2E5M2_RS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x8x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x8x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x8x32 TN F32+=E5M2*E5M2
+// GMMA 64x8x32 TN F32+=E5M2*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x8x32_F32E5M2E5M2_SS_TN
+struct MMA_64x8x32_F32E5M2E4M3_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
@@ -45294,11 +17076,12 @@ struct SM90_64x8x32_F32E5M2E5M2_SS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %6, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n8k32.f32.e5m2.e5m2 "
+      "wgmma.mma_async.sync.aligned.m64n8k32.f32.e5m2.e4m3 "
       "{%0,  %1,  %2,  %3},"
       " %4,"
       " %5,"
@@ -45309,19 +17092,19 @@ struct SM90_64x8x32_F32E5M2E5M2_SS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x8x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x8x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x8x32 TN F32+=E5M2*E5M2
+// GMMA 64x8x32 TN F32+=E5M2*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x8x32_F32E5M2E5M2_RS_TN
+struct MMA_64x8x32_F32E5M2E4M3_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
@@ -45335,11 +17118,12 @@ struct SM90_64x8x32_F32E5M2E5M2_RS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %9, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n8k32.f32.e5m2.e5m2 "
+      "wgmma.mma_async.sync.aligned.m64n8k32.f32.e5m2.e4m3 "
       "{%0,  %1,  %2,  %3},"
       "{%4,  %5,  %6,  %7},"
       " %8,"
@@ -45350,19 +17134,19 @@ struct SM90_64x8x32_F32E5M2E5M2_RS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x8x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x8x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x16x32 TN F16+=E5M2*E5M2
+// GMMA 64x16x32 TN F16+=E5M2*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x16x32_F16E5M2E5M2_SS_TN
+struct MMA_64x16x32_F16E5M2E4M3_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
@@ -45376,11 +17160,12 @@ struct SM90_64x16x32_F16E5M2E5M2_SS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %6, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n16k32.f16.e5m2.e5m2 "
+      "wgmma.mma_async.sync.aligned.m64n16k32.f16.e5m2.e4m3 "
       "{%0,  %1,  %2,  %3},"
       " %4,"
       " %5,"
@@ -45391,19 +17176,19 @@ struct SM90_64x16x32_F16E5M2E5M2_SS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x16x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x16x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x16x32 TN F16+=E5M2*E5M2
+// GMMA 64x16x32 TN F16+=E5M2*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x16x32_F16E5M2E5M2_RS_TN
+struct MMA_64x16x32_F16E5M2E4M3_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
@@ -45417,11 +17202,12 @@ struct SM90_64x16x32_F16E5M2E5M2_RS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %9, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n16k32.f16.e5m2.e5m2 "
+      "wgmma.mma_async.sync.aligned.m64n16k32.f16.e5m2.e4m3 "
       "{%0,  %1,  %2,  %3},"
       "{%4,  %5,  %6,  %7},"
       " %8,"
@@ -45432,19 +17218,19 @@ struct SM90_64x16x32_F16E5M2E5M2_RS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x16x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x16x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x16x32 TN F32+=E5M2*E5M2
+// GMMA 64x16x32 TN F32+=E5M2*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x16x32_F32E5M2E5M2_SS_TN
+struct MMA_64x16x32_F32E5M2E4M3_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
@@ -45459,11 +17245,12 @@ struct SM90_64x16x32_F32E5M2E5M2_SS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %10, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n16k32.f32.e5m2.e5m2 "
+      "wgmma.mma_async.sync.aligned.m64n16k32.f32.e5m2.e4m3 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
       " %8,"
       " %9,"
@@ -45475,19 +17262,19 @@ struct SM90_64x16x32_F32E5M2E5M2_SS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x16x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x16x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x16x32 TN F32+=E5M2*E5M2
+// GMMA 64x16x32 TN F32+=E5M2*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x16x32_F32E5M2E5M2_RS_TN
+struct MMA_64x16x32_F32E5M2E4M3_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
@@ -45502,11 +17289,12 @@ struct SM90_64x16x32_F32E5M2E5M2_RS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %13, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n16k32.f32.e5m2.e5m2 "
+      "wgmma.mma_async.sync.aligned.m64n16k32.f32.e5m2.e4m3 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
       "{%8,  %9,  %10, %11},"
       " %12,"
@@ -45518,19 +17306,19 @@ struct SM90_64x16x32_F32E5M2E5M2_RS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x16x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x16x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x32x32 TN F16+=E5M2*E5M2
+// GMMA 64x32x32 TN F16+=E5M2*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x32x32_F16E5M2E5M2_SS_TN
+struct MMA_64x32x32_F16E5M2E4M3_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
@@ -45545,11 +17333,12 @@ struct SM90_64x32x32_F16E5M2E5M2_SS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %10, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n32k32.f16.e5m2.e5m2 "
+      "wgmma.mma_async.sync.aligned.m64n32k32.f16.e5m2.e4m3 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
       " %8,"
       " %9,"
@@ -45561,19 +17350,19 @@ struct SM90_64x32x32_F16E5M2E5M2_SS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x32x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x32x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x32x32 TN F16+=E5M2*E5M2
+// GMMA 64x32x32 TN F16+=E5M2*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x32x32_F16E5M2E5M2_RS_TN
+struct MMA_64x32x32_F16E5M2E4M3_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
@@ -45588,11 +17377,12 @@ struct SM90_64x32x32_F16E5M2E5M2_RS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %13, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n32k32.f16.e5m2.e5m2 "
+      "wgmma.mma_async.sync.aligned.m64n32k32.f16.e5m2.e4m3 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
       "{%8,  %9,  %10, %11},"
       " %12,"
@@ -45604,19 +17394,19 @@ struct SM90_64x32x32_F16E5M2E5M2_RS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x32x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x32x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x32x32 TN F32+=E5M2*E5M2
+// GMMA 64x32x32 TN F32+=E5M2*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x32x32_F32E5M2E5M2_SS_TN
+struct MMA_64x32x32_F32E5M2E4M3_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
@@ -45633,11 +17423,12 @@ struct SM90_64x32x32_F32E5M2E5M2_SS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %18, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n32k32.f32.e5m2.e5m2 "
+      "wgmma.mma_async.sync.aligned.m64n32k32.f32.e5m2.e4m3 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
       " %8,  %9,  %10, %11, %12, %13, %14, %15},"
       " %16,"
@@ -45652,19 +17443,19 @@ struct SM90_64x32x32_F32E5M2E5M2_SS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x32x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x32x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x32x32 TN F32+=E5M2*E5M2
+// GMMA 64x32x32 TN F32+=E5M2*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x32x32_F32E5M2E5M2_RS_TN
+struct MMA_64x32x32_F32E5M2E4M3_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
@@ -45681,11 +17472,12 @@ struct SM90_64x32x32_F32E5M2E5M2_RS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %21, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n32k32.f32.e5m2.e5m2 "
+      "wgmma.mma_async.sync.aligned.m64n32k32.f32.e5m2.e4m3 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
       " %8,  %9,  %10, %11, %12, %13, %14, %15},"
       "{%16, %17, %18, %19},"
@@ -45700,225 +17492,19 @@ struct SM90_64x32x32_F32E5M2E5M2_RS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x32x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x48x32 TN F16+=E5M2*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x48x32_F16E5M2E5M2_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[12];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %14, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n48k32.f16.e5m2.e5m2 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11},"
-      " %12,"
-      " %13,"
-      " p,   %15, %16;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x48x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x48x32 TN F16+=E5M2*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x48x32_F16E5M2E5M2_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[12];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %17, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n48k32.f16.e5m2.e5m2 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11},"
-      "{%12, %13, %14, %15},"
-      " %16,"
-      " p,   %18, %19;\n"
-    "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x48x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x48x32 TN F32+=E5M2*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x48x32_F32E5M2E5M2_SS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint64_t[1];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[24];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint64_t const& desc_a,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %26, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n48k32.f32.e5m2.e5m2 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23},"
-      " %24,"
-      " %25,"
-      " p,   %27, %28;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
-      :  "l"(desc_a),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x48x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
-#endif
-  }
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x48x32 TN F32+=E5M2*E5M2
-template <
-  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
-  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
->
-struct SM90_64x48x32_F32E5M2E5M2_RS_TN
-{
-  using DRegisters = void;
-  using ARegisters = uint32_t[4];
-  using BRegisters = uint64_t[1];
-  using CRegisters = float[24];
-
-  CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
-      uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
-  {
-#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
-    asm volatile(
-    "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %29, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n48k32.f32.e5m2.e5m2 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23},"
-      "{%24, %25, %26, %27},"
-      " %28,"
-      " p,   %30, %31;\n"
-    "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
-         "l"(desc_b),
-         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
-#else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x48x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x32x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x64x32 TN F16+=E5M2*E5M2
+// GMMA 64x64x32 TN F16+=E5M2*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x64x32_F16E5M2E5M2_SS_TN
+struct MMA_64x64x32_F16E5M2E4M3_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
@@ -45935,11 +17521,12 @@ struct SM90_64x64x32_F16E5M2E5M2_SS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %18, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n64k32.f16.e5m2.e5m2 "
+      "wgmma.mma_async.sync.aligned.m64n64k32.f16.e5m2.e4m3 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
       " %8,  %9,  %10, %11, %12, %13, %14, %15},"
       " %16,"
@@ -45954,19 +17541,19 @@ struct SM90_64x64x32_F16E5M2E5M2_SS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x64x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x64x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x64x32 TN F16+=E5M2*E5M2
+// GMMA 64x64x32 TN F16+=E5M2*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x64x32_F16E5M2E5M2_RS_TN
+struct MMA_64x64x32_F16E5M2E4M3_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
@@ -45983,11 +17570,12 @@ struct SM90_64x64x32_F16E5M2E5M2_RS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %21, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n64k32.f16.e5m2.e5m2 "
+      "wgmma.mma_async.sync.aligned.m64n64k32.f16.e5m2.e4m3 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
       " %8,  %9,  %10, %11, %12, %13, %14, %15},"
       "{%16, %17, %18, %19},"
@@ -46002,19 +17590,19 @@ struct SM90_64x64x32_F16E5M2E5M2_RS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x64x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x64x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x64x32 TN F32+=E5M2*E5M2
+// GMMA 64x64x32 TN F32+=E5M2*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x64x32_F32E5M2E5M2_SS_TN
+struct MMA_64x64x32_F32E5M2E4M3_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
@@ -46035,11 +17623,12 @@ struct SM90_64x64x32_F32E5M2E5M2_SS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %34, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n64k32.f32.e5m2.e5m2 "
+      "wgmma.mma_async.sync.aligned.m64n64k32.f32.e5m2.e4m3 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
       " %8,  %9,  %10, %11, %12, %13, %14, %15, "
       " %16, %17, %18, %19, %20, %21, %22, %23, "
@@ -46060,19 +17649,19 @@ struct SM90_64x64x32_F32E5M2E5M2_SS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x64x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x64x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x64x32 TN F32+=E5M2*E5M2
+// GMMA 64x64x32 TN F32+=E5M2*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x64x32_F32E5M2E5M2_RS_TN
+struct MMA_64x64x32_F32E5M2E4M3_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
@@ -46093,11 +17682,12 @@ struct SM90_64x64x32_F32E5M2E5M2_RS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
       "setp.ne.b32 p, %37, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n64k32.f32.e5m2.e5m2 "
+      "wgmma.mma_async.sync.aligned.m64n64k32.f32.e5m2.e4m3 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
       " %8,  %9,  %10, %11, %12, %13, %14, %15, "
       " %16, %17, %18, %19, %20, %21, %22, %23, "
@@ -46118,25 +17708,24 @@ struct SM90_64x64x32_F32E5M2E5M2_RS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x64x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x64x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x80x32 TN F16+=E5M2*E5M2
+// GMMA 64x96x32 TN F16+=E5M2*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x80x32_F16E5M2E5M2_SS_TN
+struct MMA_64x96x32_F16E5M2E4M3_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[20];
+  using CRegisters = uint32_t[24];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
@@ -46146,50 +17735,51 @@ struct SM90_64x80x32_F16E5M2E5M2_SS_TN
       uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
       uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
       uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %22, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n80k32.f16.e5m2.e5m2 "
+      "setp.ne.b32 p, %26, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n96k32.f16.e5m2.e4m3 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
       " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19},"
-      " %20,"
-      " %21,"
-      " p,   %23, %24;\n"
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      " %24,"
+      " %25,"
+      " p,   %27, %28;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
         "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
         "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19)
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x80x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x96x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x80x32 TN F16+=E5M2*E5M2
+// GMMA 64x96x32 TN F16+=E5M2*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x80x32_F16E5M2E5M2_RS_TN
+struct MMA_64x96x32_F16E5M2E4M3_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[20];
+  using CRegisters = uint32_t[24];
 
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
@@ -46199,50 +17789,51 @@ struct SM90_64x80x32_F16E5M2E5M2_RS_TN
       uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
       uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
       uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %25, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n80k32.f16.e5m2.e5m2 "
+      "setp.ne.b32 p, %29, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n96k32.f16.e5m2.e4m3 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
       " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19},"
-      "{%20, %21, %22, %23},"
-      " %24,"
-      " p,   %26, %27;\n"
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      "{%24, %25, %26, %27},"
+      " %28,"
+      " p,   %30, %31;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
         "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
         "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19)
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x80x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x96x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x80x32 TN F32+=E5M2*E5M2
+// GMMA 64x96x32 TN F32+=E5M2*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x80x32_F32E5M2E5M2_SS_TN
+struct MMA_64x96x32_F32E5M2E4M3_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[40];
+  using CRegisters = float[48];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
@@ -46257,22 +17848,26 @@ struct SM90_64x80x32_F32E5M2E5M2_SS_TN
       float         & d28, float         & d29, float         & d30, float         & d31,
       float         & d32, float         & d33, float         & d34, float         & d35,
       float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %42, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n80k32.f32.e5m2.e5m2 "
+      "setp.ne.b32 p, %50, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n96k32.f32.e5m2.e4m3 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
       " %8,  %9,  %10, %11, %12, %13, %14, %15, "
       " %16, %17, %18, %19, %20, %21, %22, %23, "
       " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39},"
-      " %40,"
-      " %41,"
-      " p,   %43, %44;\n"
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45, %46, %47},"
+      " %48,"
+      " %49,"
+      " p,   %51, %52;\n"
     "}\n"
       : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
         "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
@@ -46283,31 +17878,31 @@ struct SM90_64x80x32_F32E5M2E5M2_SS_TN
         "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
         "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
         "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x80x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x96x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x80x32 TN F32+=E5M2*E5M2
+// GMMA 64x96x32 TN F32+=E5M2*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x80x32_F32E5M2E5M2_RS_TN
+struct MMA_64x96x32_F32E5M2E4M3_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[40];
+  using CRegisters = float[48];
 
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
@@ -46322,22 +17917,26 @@ struct SM90_64x80x32_F32E5M2E5M2_RS_TN
       float         & d28, float         & d29, float         & d30, float         & d31,
       float         & d32, float         & d33, float         & d34, float         & d35,
       float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %45, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n80k32.f32.e5m2.e5m2 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39},"
-      "{%40, %41, %42, %43},"
-      " %44,"
-      " p,   %46, %47;\n"
+      "setp.ne.b32 p, %53, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n96k32.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
+      "{%48,  %49,  %50,  %51},"
+      " %52,"
+      " p,    %54,  %55;\n"
     "}\n"
       : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
         "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
@@ -46348,30 +17947,31 @@ struct SM90_64x80x32_F32E5M2E5M2_RS_TN
         "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
         "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
         "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x80x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x96x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x96x32 TN F16+=E5M2*E5M2
+// GMMA 64x128x32 TN F16+=E5M2*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x96x32_F16E5M2E5M2_SS_TN
+struct MMA_64x128x32_F16E5M2E4M3_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[24];
+  using CRegisters = uint32_t[32];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
@@ -46382,49 +17982,55 @@ struct SM90_64x96x32_F16E5M2E5M2_SS_TN
       uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
       uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
       uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %26, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n96k32.f16.e5m2.e5m2 "
+      "setp.ne.b32 p, %34, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n128k32.f16.e5m2.e4m3 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
       " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23},"
-      " %24,"
-      " %25,"
-      " p,   %27, %28;\n"
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      " %32,"
+      " %33,"
+      " p,   %35, %36;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
         "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
         "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
         "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x96x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x128x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x96x32 TN F16+=E5M2*E5M2
+// GMMA 64x128x32 TN F16+=E5M2*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x96x32_F16E5M2E5M2_RS_TN
+struct MMA_64x128x32_F16E5M2E4M3_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[24];
+  using CRegisters = uint32_t[32];
 
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
@@ -46435,49 +18041,55 @@ struct SM90_64x96x32_F16E5M2E5M2_RS_TN
       uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
       uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
       uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %29, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n96k32.f16.e5m2.e5m2 "
+      "setp.ne.b32 p, %37, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n128k32.f16.e5m2.e4m3 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
       " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23},"
-      "{%24, %25, %26, %27},"
-      " %28,"
-      " p,   %30, %31;\n"
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      "{%32, %33, %34, %35},"
+      " %36,"
+      " p,   %38, %39;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
         "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
         "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
         "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x96x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x128x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x96x32 TN F32+=E5M2*E5M2
+// GMMA 64x128x32 TN F32+=E5M2*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x96x32_F32E5M2E5M2_SS_TN
+struct MMA_64x128x32_F32E5M2E4M3_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[48];
+  using CRegisters = float[64];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
@@ -46494,23 +18106,30 @@ struct SM90_64x96x32_F32E5M2E5M2_SS_TN
       float         & d36, float         & d37, float         & d38, float         & d39,
       float         & d40, float         & d41, float         & d42, float         & d43,
       float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %50, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n96k32.f32.e5m2.e5m2 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39, "
-      " %40, %41, %42, %43, %44, %45, %46, %47},"
-      " %48,"
-      " %49,"
-      " p,   %51, %52;\n"
+      "setp.ne.b32 p, %66, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n128k32.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      " %64,"
+      " %65,"
+      " p,    %67,  %68;\n"
     "}\n"
       : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
         "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
@@ -46523,29 +18142,33 @@ struct SM90_64x96x32_F32E5M2E5M2_SS_TN
         "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
         "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
         "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47)
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x96x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x128x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x96x32 TN F32+=E5M2*E5M2
+// GMMA 64x128x32 TN F32+=E5M2*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x96x32_F32E5M2E5M2_RS_TN
+struct MMA_64x128x32_F32E5M2E4M3_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[48];
+  using CRegisters = float[64];
 
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
@@ -46562,23 +18185,30 @@ struct SM90_64x96x32_F32E5M2E5M2_RS_TN
       float         & d36, float         & d37, float         & d38, float         & d39,
       float         & d40, float         & d41, float         & d42, float         & d43,
       float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %53, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n96k32.f32.e5m2.e5m2 "
+      "setp.ne.b32 p, %69, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n128k32.f32.e5m2.e4m3 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
       " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
-      "{%48,  %49,  %50,  %51},"
-      " %52,"
-      " p,    %54,  %55;\n"
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      "{%64,  %65,  %66,  %67},"
+      " %68,"
+      " p,    %70,  %71;\n"
     "}\n"
       : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
         "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
@@ -46591,30 +18221,33 @@ struct SM90_64x96x32_F32E5M2E5M2_RS_TN
         "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
         "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
         "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47)
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x96x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x128x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x112x32 TN F16+=E5M2*E5M2
+// GMMA 64x192x32 TN F16+=E5M2*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x112x32_F16E5M2E5M2_SS_TN
+struct MMA_64x192x32_F16E5M2E4M3_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[28];
+  using CRegisters = uint32_t[48];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
@@ -46626,21 +18259,29 @@ struct SM90_64x112x32_F16E5M2E5M2_SS_TN
       uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
       uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
       uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %30, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n112k32.f16.e5m2.e5m2 "
+      "setp.ne.b32 p, %50, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n192k32.f16.e5m2.e4m3 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
       " %8,  %9,  %10, %11, %12, %13, %14, %15, "
       " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27},"
-      " %28,"
-      " %29,"
-      " p,   %31, %32;\n"
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45, %46, %47},"
+      " %48,"
+      " %49,"
+      " p,   %51, %52;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -46648,31 +18289,34 @@ struct SM90_64x112x32_F16E5M2E5M2_SS_TN
         "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
         "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
         "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27)
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x112x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x192x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x112x32 TN F16+=E5M2*E5M2
+// GMMA 64x192x32 TN F16+=E5M2*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x112x32_F16E5M2E5M2_RS_TN
+struct MMA_64x192x32_F16E5M2E4M3_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[28];
+  using CRegisters = uint32_t[48];
 
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
@@ -46684,21 +18328,29 @@ struct SM90_64x112x32_F16E5M2E5M2_RS_TN
       uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
       uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
       uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %33, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n112k32.f16.e5m2.e5m2 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27},"
-      "{%28, %29, %30, %31},"
-      " %32,"
-      " p,   %34, %35;\n"
+      "setp.ne.b32 p, %53, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n192k32.f16.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
+      "{%48,  %49,  %50,  %51},"
+      " %52,"
+      " p,    %54,  %55;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -46706,31 +18358,34 @@ struct SM90_64x112x32_F16E5M2E5M2_RS_TN
         "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
         "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
         "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27)
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x112x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x192x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x112x32 TN F32+=E5M2*E5M2
+// GMMA 64x192x32 TN F32+=E5M2*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x112x32_F32E5M2E5M2_SS_TN
+struct MMA_64x192x32_F32E5M2E4M3_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[56];
+  using CRegisters = float[96];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
@@ -46749,24 +18404,40 @@ struct SM90_64x112x32_F32E5M2E5M2_SS_TN
       float         & d44, float         & d45, float         & d46, float         & d47,
       float         & d48, float         & d49, float         & d50, float         & d51,
       float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      float         & d92, float         & d93, float         & d94, float         & d95,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %58, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n112k32.f32.e5m2.e5m2 "
+      "setp.ne.b32 p, %98, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n192k32.f32.e5m2.e4m3 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
       " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
       " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
-      " %56,"
-      " %57,"
-      " p,    %59,  %60;\n"
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      " %96,"
+      " %97,"
+      " p,    %99,  %100;\n"
     "}\n"
       : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
         "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
@@ -46781,31 +18452,39 @@ struct SM90_64x112x32_F32E5M2E5M2_SS_TN
         "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
         "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
         "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91),
+        "+f"(d92), "+f"(d93), "+f"(d94), "+f"(d95)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x112x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x192x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x112x32 TN F32+=E5M2*E5M2
+// GMMA 64x192x32 TN F32+=E5M2*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x112x32_F32E5M2E5M2_RS_TN
+struct MMA_64x192x32_F32E5M2E4M3_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[56];
+  using CRegisters = float[96];
 
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
@@ -46824,24 +18503,40 @@ struct SM90_64x112x32_F32E5M2E5M2_RS_TN
       float         & d44, float         & d45, float         & d46, float         & d47,
       float         & d48, float         & d49, float         & d50, float         & d51,
       float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      float         & d92, float         & d93, float         & d94, float         & d95,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %61, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n112k32.f32.e5m2.e5m2 "
+      "setp.ne.b32 p, %101, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n192k32.f32.e5m2.e4m3 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
       " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
       " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
-      "{%56,  %57,  %58,  %59},"
-      " %60,"
-      " p,    %62,  %63;\n"
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      "{%96,  %97,  %98,  %99},"
+      " %100,"
+      " p,    %102, %103;\n"
     "}\n"
       : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
         "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
@@ -46856,30 +18551,39 @@ struct SM90_64x112x32_F32E5M2E5M2_RS_TN
         "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
         "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
         "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91),
+        "+f"(d92), "+f"(d93), "+f"(d94), "+f"(d95)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x112x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x192x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x128x32 TN F16+=E5M2*E5M2
+// GMMA 64x256x32 TN F16+=E5M2*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x128x32_F16E5M2E5M2_SS_TN
+struct MMA_64x256x32_F16E5M2E4M3_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[32];
+  using CRegisters = uint32_t[64];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
@@ -46892,21 +18596,34 @@ struct SM90_64x128x32_F16E5M2E5M2_SS_TN
       uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
       uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
       uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %34, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n128k32.f16.e5m2.e5m2 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31},"
-      " %32,"
-      " %33,"
-      " p,   %35, %36;\n"
+      "setp.ne.b32 p, %66, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n256k32.f16.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      " %64,"
+      " %65,"
+      " p,    %67,  %68;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -46915,29 +18632,37 @@ struct SM90_64x128x32_F16E5M2E5M2_SS_TN
         "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
         "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
         "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x128x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x256x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x128x32 TN F16+=E5M2*E5M2
+// GMMA 64x256x32 TN F16+=E5M2*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x128x32_F16E5M2E5M2_RS_TN
+struct MMA_64x256x32_F16E5M2E4M3_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[32];
+  using CRegisters = uint32_t[64];
 
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
@@ -46950,21 +18675,34 @@ struct SM90_64x128x32_F16E5M2E5M2_RS_TN
       uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
       uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
       uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %37, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n128k32.f16.e5m2.e5m2 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31},"
-      "{%32, %33, %34, %35},"
-      " %36,"
-      " p,   %38, %39;\n"
+      "setp.ne.b32 p, %69, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n256k32.f16.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      "{%64,  %65,  %66,  %67},"
+      " %68,"
+      " p,    %70,  %71;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -46973,57 +18711,82 @@ struct SM90_64x128x32_F16E5M2E5M2_RS_TN
         "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
         "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
         "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x128x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x256x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x128x32 TN F32+=E5M2*E5M2
+// GMMA 64x256x32 TN F32+=E5M2*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x128x32_F32E5M2E5M2_SS_TN
+struct MMA_64x256x32_F32E5M2E4M3_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[64];
+  using CRegisters = float[128];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      float         & d124, float         & d125, float         & d126, float         & d127,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %66, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n128k32.f32.e5m2.e5m2 "
+      "setp.ne.b32 p, %130, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n256k32.f32.e5m2.e4m3 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -47031,77 +18794,118 @@ struct SM90_64x128x32_F32E5M2E5M2_SS_TN
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
       " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
       " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
-      " %64,"
-      " %65,"
-      " p,    %67,  %68;\n"
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      " %128,"
+      " %129,"
+      " p,    %131, %132;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123),
+        "+f"(d124), "+f"(d125), "+f"(d126), "+f"(d127)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x128x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x256x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x128x32 TN F32+=E5M2*E5M2
+// GMMA 64x256x32 TN F32+=E5M2*E4M3
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x128x32_F32E5M2E5M2_RS_TN
+struct MMA_64x256x32_F32E5M2E4M3_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[64];
+  using CRegisters = float[128];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      float         & d124, float         & d125, float         & d126, float         & d127,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %69, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n128k32.f32.e5m2.e5m2 "
+      "setp.ne.b32 p, %133, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n256k32.f32.e5m2.e4m3 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -47109,792 +18913,501 @@ struct SM90_64x128x32_F32E5M2E5M2_RS_TN
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
       " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
       " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
-      "{%64,  %65,  %66,  %67},"
-      " %68,"
-      " p,    %70,  %71;\n"
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      "{%128, %129, %130, %131},"
+      " %132,"
+      " p,    %134, %135;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123),
+        "+f"(d124), "+f"(d125), "+f"(d126), "+f"(d127)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x128x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x256x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x144x32 TN F16+=E5M2*E5M2
+// GMMA 64x8x32 TN F16+=E5M2*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x144x32_F16E5M2E5M2_SS_TN
+struct MMA_64x8x32_F16E5M2E5M2_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[36];
+  using CRegisters = uint32_t[2];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d0, uint32_t      & d1,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %38, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n144k32.f16.e5m2.e5m2 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35},"
-      " %36,"
-      " %37,"
-      " p,   %39, %40;\n"
+      "setp.ne.b32 p, %4, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n8k32.f16.e5m2.e5m2 "
+      "{%0, %1},"
+      " %2,"
+      " %3,"
+      " p,  %5, %6;\n"
     "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35)
+      : "+r"(d0), "+r"(d1)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x144x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x8x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x144x32 TN F16+=E5M2*E5M2
+// GMMA 64x8x32 TN F16+=E5M2*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x144x32_F16E5M2E5M2_RS_TN
+struct MMA_64x8x32_F16E5M2E5M2_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[36];
+  using CRegisters = uint32_t[2];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
       uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d0, uint32_t      & d1,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %41, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n144k32.f16.e5m2.e5m2 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35},"
-      "{%36, %37, %38, %39},"
-      " %40,"
-      " p,   %42, %43;\n"
+      "setp.ne.b32 p, %7, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n8k32.f16.e5m2.e5m2 "
+      "{%0,  %1},"
+      "{%2,  %3,  %4,  %5},"
+      " %6,"
+      " p,   %8,  %9;\n"
     "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+      : "+r"(d0), "+r"(d1)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x144x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x8x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x144x32 TN F32+=E5M2*E5M2
+// GMMA 64x8x32 TN F32+=E5M2*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x144x32_F32E5M2E5M2_SS_TN
+struct MMA_64x8x32_F32E5M2E5M2_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[72];
+  using CRegisters = float[4];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d0, float         & d1, float         & d2, float         & d3,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
-      ".reg .pred p;\n"
-      "setp.ne.b32 p, %74, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n144k32.f32.e5m2.e5m2 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
-      " %72,"
-      " %73,"
-      " p,    %75,  %76;\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %6, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n8k32.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3},"
+      " %4,"
+      " %5,"
+      " p,   %7,  %8;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71)
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x144x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x8x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x144x32 TN F32+=E5M2*E5M2
+// GMMA 64x8x32 TN F32+=E5M2*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x144x32_F32E5M2E5M2_RS_TN
+struct MMA_64x8x32_F32E5M2E5M2_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[72];
+  using CRegisters = float[4];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d0, float         & d1, float         & d2, float         & d3,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %77, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n144k32.f32.e5m2.e5m2 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
-      "{%72,  %73,  %74,  %75},"
-      " %76,"
-      " p,    %78,  %79;\n"
+      "setp.ne.b32 p, %9, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n8k32.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3},"
+      "{%4,  %5,  %6,  %7},"
+      " %8,"
+      " p,   %10, %11;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x144x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x8x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x160x32 TN F16+=E5M2*E5M2
+// GMMA 64x16x32 TN F16+=E5M2*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x160x32_F16E5M2E5M2_SS_TN
+struct MMA_64x16x32_F16E5M2E5M2_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[40];
+  using CRegisters = uint32_t[4];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %42, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n160k32.f16.e5m2.e5m2 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39},"
-      " %40,"
-      " %41,"
-      " p,   %43, %44;\n"
+      "setp.ne.b32 p, %6, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n16k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3},"
+      " %4,"
+      " %5,"
+      " p,   %7,  %8;\n"
     "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x160x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x16x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x160x32 TN F16+=E5M2*E5M2
+// GMMA 64x16x32 TN F16+=E5M2*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x160x32_F16E5M2E5M2_RS_TN
+struct MMA_64x16x32_F16E5M2E5M2_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[40];
+  using CRegisters = uint32_t[4];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
       uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %45, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n160k32.f16.e5m2.e5m2 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39},"
-      "{%40, %41, %42, %43},"
-      " %44,"
-      " p,   %46, %47;\n"
+      "setp.ne.b32 p, %9, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n16k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3},"
+      "{%4,  %5,  %6,  %7},"
+      " %8,"
+      " p,   %10, %11;\n"
     "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x160x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x16x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x160x32 TN F32+=E5M2*E5M2
+// GMMA 64x16x32 TN F32+=E5M2*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x160x32_F32E5M2E5M2_SS_TN
+struct MMA_64x16x32_F32E5M2E5M2_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[80];
+  using CRegisters = float[8];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      float         & d72, float         & d73, float         & d74, float         & d75,
-      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d0, float         & d1, float         & d2, float         & d3,
+      float         & d4, float         & d5, float         & d6, float         & d7,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %82, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n160k32.f32.e5m2.e5m2 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
-      " %80,"
-      " %81,"
-      " p,    %83,  %84;\n"
+      "setp.ne.b32 p, %10, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n16k32.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      " %8,"
+      " %9,"
+      " p,   %11, %12;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
-        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
-        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79)
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3),
+        "+f"(d4), "+f"(d5), "+f"(d6), "+f"(d7)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x160x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x16x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x160x32 TN F32+=E5M2*E5M2
+// GMMA 64x16x32 TN F32+=E5M2*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x160x32_F32E5M2E5M2_RS_TN
+struct MMA_64x16x32_F32E5M2E5M2_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[80];
+  using CRegisters = float[8];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
       uint64_t const& desc_b,
-      float         & d00, float         & d01, float         & d02, float         & d03,
-      float         & d04, float         & d05, float         & d06, float         & d07,
-      float         & d08, float         & d09, float         & d10, float         & d11,
-      float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      float         & d72, float         & d73, float         & d74, float         & d75,
-      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d0, float         & d1, float         & d2, float         & d3,
+      float         & d4, float         & d5, float         & d6, float         & d7,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %85, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n160k32.f32.e5m2.e5m2 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
-      "{%80,  %81,  %82,  %83},"
-      " %84,"
-      " p,    %86,  %87;\n"
+      "setp.ne.b32 p, %13, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n16k32.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      "{%8,  %9,  %10, %11},"
+      " %12,"
+      " p,   %14, %15;\n"
     "}\n"
-      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
-        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
-        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
-        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
-        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3),
+        "+f"(d4), "+f"(d5), "+f"(d6), "+f"(d7)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x160x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x16x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x176x32 TN F16+=E5M2*E5M2
+// GMMA 64x32x32 TN F16+=E5M2*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x176x32_F16E5M2E5M2_SS_TN
+struct MMA_64x32x32_F16E5M2E5M2_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[44];
+  using CRegisters = uint32_t[8];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %46, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n176k32.f16.e5m2.e5m2 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39, "
-      " %40, %41, %42, %43},"
-      " %44,"
-      " %45,"
-      " p,   %47, %48;\n"
+      "setp.ne.b32 p, %10, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n32k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      " %8,"
+      " %9,"
+      " p,   %11, %12;\n"
     "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43)
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x176x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x32x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x176x32 TN F16+=E5M2*E5M2
+// GMMA 64x32x32 TN F16+=E5M2*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x176x32_F16E5M2E5M2_RS_TN
+struct MMA_64x32x32_F16E5M2E5M2_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[44];
+  using CRegisters = uint32_t[8];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
       uint64_t const& desc_b,
-      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
-      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
-      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
-      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %49, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n176k32.f16.e5m2.e5m2 "
-      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39, "
-      " %40, %41, %42, %43},"
-      "{%44, %45, %46, %47},"
-      " %48,"
-      " p,   %50, %51;\n"
+      "setp.ne.b32 p, %13, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n32k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      "{%8,  %9,  %10, %11},"
+      " %12,"
+      " p,   %14, %15;\n"
     "}\n"
-      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
-        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
-        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43)
-      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x176x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x32x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x176x32 TN F32+=E5M2*E5M2
+// GMMA 64x32x32 TN F32+=E5M2*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x176x32_F32E5M2E5M2_SS_TN
+struct MMA_64x32x32_F32E5M2E5M2_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[88];
+  using CRegisters = float[16];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
@@ -47903,93 +19416,47 @@ struct SM90_64x176x32_F32E5M2E5M2_SS_TN
       float         & d04, float         & d05, float         & d06, float         & d07,
       float         & d08, float         & d09, float         & d10, float         & d11,
       float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      float         & d72, float         & d73, float         & d74, float         & d75,
-      float         & d76, float         & d77, float         & d78, float         & d79,
-      float         & d80, float         & d81, float         & d82, float         & d83,
-      float         & d84, float         & d85, float         & d86, float         & d87,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %90, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n176k32.f32.e5m2.e5m2 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
-      " %88,"
-      " %89,"
-      " p,    %91,  %92;\n"
+      "setp.ne.b32 p, %18, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n32k32.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      " %16,"
+      " %17,"
+      " p,   %19, %20;\n"
     "}\n"
       : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
         "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
         "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
-        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
-        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
-        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
-        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87)
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x176x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x32x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x176x32 TN F32+=E5M2*E5M2
+// GMMA 64x32x32 TN F32+=E5M2*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x176x32_F32E5M2E5M2_RS_TN
+struct MMA_64x32x32_F32E5M2E5M2_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[88];
+  using CRegisters = float[16];
 
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
@@ -47998,92 +19465,47 @@ struct SM90_64x176x32_F32E5M2E5M2_RS_TN
       float         & d04, float         & d05, float         & d06, float         & d07,
       float         & d08, float         & d09, float         & d10, float         & d11,
       float         & d12, float         & d13, float         & d14, float         & d15,
-      float         & d16, float         & d17, float         & d18, float         & d19,
-      float         & d20, float         & d21, float         & d22, float         & d23,
-      float         & d24, float         & d25, float         & d26, float         & d27,
-      float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      float         & d72, float         & d73, float         & d74, float         & d75,
-      float         & d76, float         & d77, float         & d78, float         & d79,
-      float         & d80, float         & d81, float         & d82, float         & d83,
-      float         & d84, float         & d85, float         & d86, float         & d87,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %93, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n176k32.f32.e5m2.e5m2 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
-      "{%88,  %89,  %90,  %91},"
-      " %92,"
-      " p,    %94,  %95;\n"
+      "setp.ne.b32 p, %21, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n32k32.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      "{%16, %17, %18, %19},"
+      " %20,"
+      " p,   %22, %23;\n"
     "}\n"
       : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
         "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
         "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
-        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
-        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
-        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
-        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
-        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
-        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
-        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
-        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87)
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x176x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x32x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x192x32 TN F16+=E5M2*E5M2
+// GMMA 64x64x32 TN F16+=E5M2*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x192x32_F16E5M2E5M2_SS_TN
+struct MMA_64x64x32_F16E5M2E5M2_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[48];
+  using CRegisters = uint32_t[16];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
@@ -48092,66 +19514,47 @@ struct SM90_64x192x32_F16E5M2E5M2_SS_TN
       uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
       uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
       uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %50, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n192k32.f16.e5m2.e5m2 "
+      "setp.ne.b32 p, %18, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n64k32.f16.e5m2.e5m2 "
       "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
-      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
-      " %16, %17, %18, %19, %20, %21, %22, %23, "
-      " %24, %25, %26, %27, %28, %29, %30, %31, "
-      " %32, %33, %34, %35, %36, %37, %38, %39, "
-      " %40, %41, %42, %43, %44, %45, %46, %47},"
-      " %48,"
-      " %49,"
-      " p,   %51, %52;\n"
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      " %16,"
+      " %17,"
+      " p,   %19, %20;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
         "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x192x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x64x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x192x32 TN F16+=E5M2*E5M2
+// GMMA 64x64x32 TN F16+=E5M2*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x192x32_F16E5M2E5M2_RS_TN
+struct MMA_64x64x32_F16E5M2E5M2_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[48];
+  using CRegisters = uint32_t[16];
 
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
@@ -48160,66 +19563,47 @@ struct SM90_64x192x32_F16E5M2E5M2_RS_TN
       uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
       uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
       uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
-      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
-      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %53, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n192k32.f16.e5m2.e5m2 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
-      "{%48,  %49,  %50,  %51},"
-      " %52,"
-      " p,    %54,  %55;\n"
+      "setp.ne.b32 p, %21, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n64k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      "{%16, %17, %18, %19},"
+      " %20,"
+      " p,   %22, %23;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
         "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
-        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x192x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x64x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x192x32 TN F32+=E5M2*E5M2
+// GMMA 64x64x32 TN F32+=E5M2*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x192x32_F32E5M2E5M2_SS_TN
+struct MMA_64x64x32_F32E5M2E5M2_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[96];
+  using CRegisters = float[32];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
@@ -48232,45 +19616,22 @@ struct SM90_64x192x32_F32E5M2E5M2_SS_TN
       float         & d20, float         & d21, float         & d22, float         & d23,
       float         & d24, float         & d25, float         & d26, float         & d27,
       float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      float         & d72, float         & d73, float         & d74, float         & d75,
-      float         & d76, float         & d77, float         & d78, float         & d79,
-      float         & d80, float         & d81, float         & d82, float         & d83,
-      float         & d84, float         & d85, float         & d86, float         & d87,
-      float         & d88, float         & d89, float         & d90, float         & d91,
-      float         & d92, float         & d93, float         & d94, float         & d95,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %98, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n192k32.f32.e5m2.e5m2 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
-      " %96,"
-      " %97,"
-      " p,    %99,  %100;\n"
+      "setp.ne.b32 p, %34, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n64k32.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      " %32,"
+      " %33,"
+      " p,   %35, %36;\n"
     "}\n"
       : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
         "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
@@ -48279,45 +19640,29 @@ struct SM90_64x192x32_F32E5M2E5M2_SS_TN
         "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
         "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
         "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
-        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
-        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
-        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
-        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
-        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91),
-        "+f"(d92), "+f"(d93), "+f"(d94), "+f"(d95)
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x192x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x64x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-// GMMA 64x192x32 TN F32+=E5M2*E5M2
+// GMMA 64x64x32 TN F32+=E5M2*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x192x32_F32E5M2E5M2_RS_TN
+struct MMA_64x64x32_F32E5M2E5M2_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[96];
+  using CRegisters = float[32];
 
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
@@ -48330,45 +19675,22 @@ struct SM90_64x192x32_F32E5M2E5M2_RS_TN
       float         & d20, float         & d21, float         & d22, float         & d23,
       float         & d24, float         & d25, float         & d26, float         & d27,
       float         & d28, float         & d29, float         & d30, float         & d31,
-      float         & d32, float         & d33, float         & d34, float         & d35,
-      float         & d36, float         & d37, float         & d38, float         & d39,
-      float         & d40, float         & d41, float         & d42, float         & d43,
-      float         & d44, float         & d45, float         & d46, float         & d47,
-      float         & d48, float         & d49, float         & d50, float         & d51,
-      float         & d52, float         & d53, float         & d54, float         & d55,
-      float         & d56, float         & d57, float         & d58, float         & d59,
-      float         & d60, float         & d61, float         & d62, float         & d63,
-      float         & d64, float         & d65, float         & d66, float         & d67,
-      float         & d68, float         & d69, float         & d70, float         & d71,
-      float         & d72, float         & d73, float         & d74, float         & d75,
-      float         & d76, float         & d77, float         & d78, float         & d79,
-      float         & d80, float         & d81, float         & d82, float         & d83,
-      float         & d84, float         & d85, float         & d86, float         & d87,
-      float         & d88, float         & d89, float         & d90, float         & d91,
-      float         & d92, float         & d93, float         & d94, float         & d95,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %101, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n192k32.f32.e5m2.e5m2 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
-      "{%96,  %97,  %98,  %99},"
-      " %100,"
-      " p,    %102, %103;\n"
+      "setp.ne.b32 p, %37, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n64k32.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      "{%32, %33, %34, %35},"
+      " %36,"
+      " p,   %38, %39;\n"
     "}\n"
       : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
         "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
@@ -48377,46 +19699,29 @@ struct SM90_64x192x32_F32E5M2E5M2_RS_TN
         "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
         "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
         "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
-        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
-        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
-        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
-        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
-        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
-        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
-        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
-        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
-        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
-        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
-        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
-        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
-        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
-        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
-        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
-        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91),
-        "+f"(d92), "+f"(d93), "+f"(d94), "+f"(d95)
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x192x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x64x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x208x32 TN F16+=E5M2*E5M2
+// GMMA 64x96x32 TN F16+=E5M2*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x208x32_F16E5M2E5M2_SS_TN
+struct MMA_64x96x32_F16E5M2E5M2_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[52];
+  using CRegisters = uint32_t[24];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
@@ -48427,69 +19732,50 @@ struct SM90_64x208x32_F16E5M2E5M2_SS_TN
       uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
       uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
       uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %54, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n208k32.f16.e5m2.e5m2 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51},"
-      " %52,"
-      " %53,"
-      " p,    %55,  %56;\n"
+      "setp.ne.b32 p, %26, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n96k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      " %24,"
+      " %25,"
+      " p,   %27, %28;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
         "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
         "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
         "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51)
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x208x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x96x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x208x32 TN F16+=E5M2*E5M2
+// GMMA 64x96x32 TN F16+=E5M2*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x208x32_F16E5M2E5M2_RS_TN
+struct MMA_64x96x32_F16E5M2E5M2_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[52];
+  using CRegisters = uint32_t[24];
 
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
@@ -48500,279 +19786,188 @@ struct SM90_64x208x32_F16E5M2E5M2_RS_TN
       uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
       uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
       uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
-      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
-      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %57, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n208k32.f16.e5m2.e5m2 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51},"
-      "{%52,  %53,  %54,  %55},"
-      " %56,"
-      " p,    %58,  %59;\n"
+      "setp.ne.b32 p, %29, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n96k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      "{%24, %25, %26, %27},"
+      " %28,"
+      " p,   %30, %31;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
         "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
         "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
         "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51)
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x208x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x96x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x208x32 TN F32+=E5M2*E5M2
+// GMMA 64x96x32 TN F32+=E5M2*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x208x32_F32E5M2E5M2_SS_TN
+struct MMA_64x96x32_F32E5M2E5M2_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[104];
+  using CRegisters = float[48];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %106, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n208k32.f32.e5m2.e5m2 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
-      " %104,"
-      " %105,"
-      " p,    %107, %108;\n"
+      "setp.ne.b32 p, %50, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n96k32.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45, %46, %47},"
+      " %48,"
+      " %49,"
+      " p,   %51, %52;\n"
     "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103)
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x208x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x96x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x208x32 TN F32+=E5M2*E5M2
+// GMMA 64x96x32 TN F32+=E5M2*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x208x32_F32E5M2E5M2_RS_TN
+struct MMA_64x96x32_F32E5M2E5M2_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[104];
+  using CRegisters = float[48];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
       uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %109, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n208k32.f32.e5m2.e5m2 "
+      "setp.ne.b32 p, %53, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n96k32.f32.e5m2.e5m2 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
       " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
-      "{%104, %105, %106, %107},"
-      " %108,"
-      " p,    %110, %111;\n"
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
+      "{%48,  %49,  %50,  %51},"
+      " %52,"
+      " p,    %54,  %55;\n"
     "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x208x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x96x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x224x32 TN F16+=E5M2*E5M2
+// GMMA 64x128x32 TN F16+=E5M2*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x224x32_F16E5M2E5M2_SS_TN
+struct MMA_64x128x32_F16E5M2E5M2_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[56];
+  using CRegisters = uint32_t[32];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
@@ -48785,30 +19980,22 @@ struct SM90_64x224x32_F16E5M2E5M2_SS_TN
       uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
       uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
       uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %58, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n224k32.f16.e5m2.e5m2 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
-      " %56,"
-      " %57,"
-      " p,    %59,  %60;\n"
+      "setp.ne.b32 p, %34, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n128k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      " %32,"
+      " %33,"
+      " p,   %35, %36;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -48817,37 +20004,29 @@ struct SM90_64x224x32_F16E5M2E5M2_SS_TN
         "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
         "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
         "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x224x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x128x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x224x32 TN F16+=E5M2*E5M2
+// GMMA 64x128x32 TN F16+=E5M2*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x224x32_F16E5M2E5M2_RS_TN
+struct MMA_64x128x32_F16E5M2E5M2_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[56];
+  using CRegisters = uint32_t[32];
 
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
@@ -48860,109 +20039,82 @@ struct SM90_64x224x32_F16E5M2E5M2_RS_TN
       uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
       uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
       uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
-      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
-      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
-      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
-      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %61, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n224k32.f16.e5m2.e5m2 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
-      "{%56,  %57,  %58,  %59},"
-      " %60,"
-      " p,    %62,  %63;\n"
+      "setp.ne.b32 p, %37, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n128k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      "{%32, %33, %34, %35},"
+      " %36,"
+      " p,   %38, %39;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
         "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
         "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
-        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
-        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
-        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
-        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
-        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
-        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
-        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x224x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x128x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x224x32 TN F32+=E5M2*E5M2
+// GMMA 64x128x32 TN F32+=E5M2*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x224x32_F32E5M2E5M2_SS_TN
+struct MMA_64x128x32_F32E5M2E5M2_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[112];
+  using CRegisters = float[64];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
-      float         & d104, float         & d105, float         & d106, float         & d107,
-      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %114, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n224k32.f32.e5m2.e5m2 "
+      "setp.ne.b32 p, %66, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n128k32.f32.e5m2.e5m2 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -48970,109 +20122,78 @@ struct SM90_64x224x32_F32E5M2E5M2_SS_TN
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
       " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
       " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111},"
-      " %112,"
-      " %113,"
-      " p,    %115, %116;\n"
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      " %64,"
+      " %65,"
+      " p,    %67,  %68;\n"
     "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
-        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
-        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111)
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x224x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x128x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x224x32 TN F32+=E5M2*E5M2
+// GMMA 64x128x32 TN F32+=E5M2*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x224x32_F32E5M2E5M2_RS_TN
+struct MMA_64x128x32_F32E5M2E5M2_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[112];
+  using CRegisters = float[64];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
       uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
-      float         & d104, float         & d105, float         & d106, float         & d107,
-      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %117, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n224k32.f32.e5m2.e5m2 "
+      "setp.ne.b32 p, %69, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n128k32.f32.e5m2.e5m2 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -49080,69 +20201,49 @@ struct SM90_64x224x32_F32E5M2E5M2_RS_TN
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
       " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
       " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
-      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
-      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
-      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111},"
-      "{%112, %113, %114, %115},"
-      " %116,"
-      " p,    %118, %119;\n"
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      "{%64,  %65,  %66,  %67},"
+      " %68,"
+      " p,    %70,  %71;\n"
     "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
-        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
-        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x224x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x128x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x240x32 TN F16+=E5M2*E5M2
+// GMMA 64x192x32 TN F16+=E5M2*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x240x32_F16E5M2E5M2_SS_TN
+struct MMA_64x192x32_F16E5M2E5M2_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[60];
+  using CRegisters = uint32_t[48];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
@@ -49159,28 +20260,24 @@ struct SM90_64x240x32_F16E5M2E5M2_SS_TN
       uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
       uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
       uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %62, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n240k32.f16.e5m2.e5m2 "
-      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
-      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
-      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
-      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
-      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59},"
-      " %60,"
-      " %61,"
-      " p,    %63,  %64;\n"
+      "setp.ne.b32 p, %50, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n192k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45, %46, %47},"
+      " %48,"
+      " %49,"
+      " p,   %51, %52;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -49193,34 +20290,29 @@ struct SM90_64x240x32_F16E5M2E5M2_SS_TN
         "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
         "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
         "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59)
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x240x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x192x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x240x32 TN F16+=E5M2*E5M2
+// GMMA 64x192x32 TN F16+=E5M2*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x240x32_F16E5M2E5M2_RS_TN
+struct MMA_64x192x32_F16E5M2E5M2_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = uint32_t[60];
+  using CRegisters = uint32_t[48];
 
   CUTE_HOST_DEVICE static void
   fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
@@ -49237,28 +20329,24 @@ struct SM90_64x240x32_F16E5M2E5M2_RS_TN
       uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
       uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
       uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
-      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
-      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
-      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %65, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n240k32.f16.e5m2.e5m2 "
+      "setp.ne.b32 p, %53, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n192k32.f16.e5m2.e5m2 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
       " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
       " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
-      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
-      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
-      " %56,  %57,  %58,  %59},"
-      "{%60,  %61,  %62,  %63},"
-      " %64,"
-      " p,    %66,  %67;\n"
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
+      "{%48,  %49,  %50,  %51},"
+      " %52,"
+      " p,    %54,  %55;\n"
     "}\n"
       : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
         "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
@@ -49271,76 +20359,66 @@ struct SM90_64x240x32_F16E5M2E5M2_RS_TN
         "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
         "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
         "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
-        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
-        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
-        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
-        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59)
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
       :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x240x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x192x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x240x32 TN F32+=E5M2*E5M2
+// GMMA 64x192x32 TN F32+=E5M2*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x240x32_F32E5M2E5M2_SS_TN
+struct MMA_64x192x32_F32E5M2E5M2_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[120];
+  using CRegisters = float[96];
 
   CUTE_HOST_DEVICE static void
   fma(uint64_t const& desc_a,
       uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
-      float         & d104, float         & d105, float         & d106, float         & d107,
-      float         & d108, float         & d109, float         & d110, float         & d111,
-      float         & d112, float         & d113, float         & d114, float         & d115,
-      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      float         & d92, float         & d93, float         & d94, float         & d95,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %122, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n240k32.f32.e5m2.e5m2 "
+      "setp.ne.b32 p, %98, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n192k32.f32.e5m2.e5m2 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -49352,110 +20430,94 @@ struct SM90_64x240x32_F32E5M2E5M2_SS_TN
       " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
       " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
       " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119},"
-      " %120,"
-      " %121,"
-      " p,    %123, %124;\n"
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      " %96,"
+      " %97,"
+      " p,    %99,  %100;\n"
     "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
-        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
-        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
-        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
-        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119)
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91),
+        "+f"(d92), "+f"(d93), "+f"(d94), "+f"(d95)
       :  "l"(desc_a),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x240x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x192x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-// GMMA 64x240x32 TN F32+=E5M2*E5M2
+// GMMA 64x192x32 TN F32+=E5M2*E5M2
 template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x240x32_F32E5M2E5M2_RS_TN
+struct MMA_64x192x32_F32E5M2E5M2_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
   using BRegisters = uint64_t[1];
-  using CRegisters = float[120];
+  using CRegisters = float[96];
 
   CUTE_HOST_DEVICE static void
-  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
       uint64_t const& desc_b,
-      float         & d000, float         & d001, float         & d002, float         & d003,
-      float         & d004, float         & d005, float         & d006, float         & d007,
-      float         & d008, float         & d009, float         & d010, float         & d011,
-      float         & d012, float         & d013, float         & d014, float         & d015,
-      float         & d016, float         & d017, float         & d018, float         & d019,
-      float         & d020, float         & d021, float         & d022, float         & d023,
-      float         & d024, float         & d025, float         & d026, float         & d027,
-      float         & d028, float         & d029, float         & d030, float         & d031,
-      float         & d032, float         & d033, float         & d034, float         & d035,
-      float         & d036, float         & d037, float         & d038, float         & d039,
-      float         & d040, float         & d041, float         & d042, float         & d043,
-      float         & d044, float         & d045, float         & d046, float         & d047,
-      float         & d048, float         & d049, float         & d050, float         & d051,
-      float         & d052, float         & d053, float         & d054, float         & d055,
-      float         & d056, float         & d057, float         & d058, float         & d059,
-      float         & d060, float         & d061, float         & d062, float         & d063,
-      float         & d064, float         & d065, float         & d066, float         & d067,
-      float         & d068, float         & d069, float         & d070, float         & d071,
-      float         & d072, float         & d073, float         & d074, float         & d075,
-      float         & d076, float         & d077, float         & d078, float         & d079,
-      float         & d080, float         & d081, float         & d082, float         & d083,
-      float         & d084, float         & d085, float         & d086, float         & d087,
-      float         & d088, float         & d089, float         & d090, float         & d091,
-      float         & d092, float         & d093, float         & d094, float         & d095,
-      float         & d096, float         & d097, float         & d098, float         & d099,
-      float         & d100, float         & d101, float         & d102, float         & d103,
-      float         & d104, float         & d105, float         & d106, float         & d107,
-      float         & d108, float         & d109, float         & d110, float         & d111,
-      float         & d112, float         & d113, float         & d114, float         & d115,
-      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      float         & d92, float         & d93, float         & d94, float         & d95,
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
-      "setp.ne.b32 p, %125, 0;\n"
-      "wgmma.mma_async.sync.aligned.m64n240k32.f32.e5m2.e5m2 "
+      "setp.ne.b32 p, %101, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n192k32.f32.e5m2.e5m2 "
       "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
       " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
       " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
@@ -49467,53 +20529,43 @@ struct SM90_64x240x32_F32E5M2E5M2_RS_TN
       " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
       " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
       " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
-      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
-      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
-      " %104, %105, %106, %107, %108, %109, %110, %111, "
-      " %112, %113, %114, %115, %116, %117, %118, %119},"
-      "{%120, %121, %122, %123},"
-      " %124,"
-      " p,    %126, %127;\n"
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      "{%96,  %97,  %98,  %99},"
+      " %100,"
+      " p,    %102, %103;\n"
     "}\n"
-      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
-        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
-        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
-        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
-        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
-        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
-        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
-        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
-        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
-        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
-        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
-        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
-        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
-        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
-        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
-        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
-        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
-        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
-        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
-        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
-        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
-        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
-        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
-        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
-        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
-        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
-        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
-        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
-        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
-        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119)
-      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91),
+        "+f"(d92), "+f"(d93), "+f"(d94), "+f"(d95)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x240x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x192x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
@@ -49522,7 +20574,7 @@ template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x256x32_F16E5M2E5M2_SS_TN
+struct MMA_64x256x32_F16E5M2E5M2_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
@@ -49551,6 +20603,7 @@ struct SM90_64x256x32_F16E5M2E5M2_SS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
@@ -49588,7 +20641,7 @@ struct SM90_64x256x32_F16E5M2E5M2_SS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x256x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x256x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
@@ -49600,7 +20653,7 @@ template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x256x32_F16E5M2E5M2_RS_TN
+struct MMA_64x256x32_F16E5M2E5M2_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
@@ -49629,6 +20682,7 @@ struct SM90_64x256x32_F16E5M2E5M2_RS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
@@ -49666,7 +20720,7 @@ struct SM90_64x256x32_F16E5M2E5M2_RS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x256x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x256x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
@@ -49678,7 +20732,7 @@ template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x256x32_F32E5M2E5M2_SS_TN
+struct MMA_64x256x32_F32E5M2E5M2_SS_TN
 {
   using DRegisters = void;
   using ARegisters = uint64_t[1];
@@ -49723,6 +20777,7 @@ struct SM90_64x256x32_F32E5M2E5M2_SS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
@@ -49784,7 +20839,7 @@ struct SM90_64x256x32_F32E5M2E5M2_SS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x256x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x256x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
@@ -49796,7 +20851,7 @@ template <
   GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
   GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
 >
-struct SM90_64x256x32_F32E5M2E5M2_RS_TN
+struct MMA_64x256x32_F32E5M2E5M2_RS_TN
 {
   using DRegisters = void;
   using ARegisters = uint32_t[4];
@@ -49841,6 +20896,7 @@ struct SM90_64x256x32_F32E5M2E5M2_RS_TN
       GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
   {
 #if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
     asm volatile(
     "{\n"
       ".reg .pred p;\n"
@@ -49902,11 +20958,17 @@ struct SM90_64x256x32_F32E5M2E5M2_RS_TN
          "l"(desc_b),
          "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
 #else
-    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90_64x256x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x256x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
 #endif
   }
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+} // namespace SM90::GMMA
+
 } // namespace cute
+
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+#include "mma_sm90_gmma_ext.hpp"
+#endif
diff --git a/include/cute/arch/mma_sm90_gmma_ext.hpp b/include/cute/arch/mma_sm90_gmma_ext.hpp
new file mode 100644
index 0000000000..10a36aff80
--- /dev/null
+++ b/include/cute/arch/mma_sm90_gmma_ext.hpp
@@ -0,0 +1,56445 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+ 
+#pragma once
+  
+#include <cute/config.hpp>                // CUTE_HOST_DEVICE
+
+#include "cutlass/arch/synclog.hpp"
+
+// Config
+#if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900) && defined(__CUDA_ARCH_FEAT_SM90_ALL))
+#  define CUTE_ARCH_MMA_SM90A_ENABLED
+#endif
+
+namespace cute {
+
+namespace SM90::GMMA {
+
+// GMMA 64x24x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x24x16_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[6];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %8, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n24k16.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5},"
+      " %6,"
+      " %7,"
+      " p,   %9,  %10, %11, %12;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x24x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x24x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x24x16_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[6];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %11, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n24k16.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5},"
+      "{%6,  %7,  %8,  %9},"
+      " %10,"
+      " p,   %12, %13, %14;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x24x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x40x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x40x16_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[10];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %12, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n40k16.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9},"
+      " %10,"
+      " %11,"
+      " p,   %13, %14, %15, %16;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x40x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x40x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x40x16_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[10];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %15, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n40k16.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9},"
+      "{%10, %11, %12, %13},"
+      " %14,"
+      " p,   %16, %17, %18;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x40x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x48x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x48x16_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %14, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n48k16.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      " %12,"
+      " %13,"
+      " p,   %15, %16, %17, %18;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x48x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x48x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x48x16_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %17, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n48k16.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      "{%12, %13, %14, %15},"
+      " %16,"
+      " p,   %18, %19, %20;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x48x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x56x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x56x16_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[14];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %16, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n56k16.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13},"
+      " %14,"
+      " %15,"
+      " p,   %17, %18, %19, %20;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x56x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x56x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x56x16_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[14];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %19, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n56k16.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13},"
+      "{%14, %15, %16, %17},"
+      " %18,"
+      " p,   %20, %21, %22;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x56x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x72x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x72x16_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[18];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %20, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n72k16.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17},"
+      " %18,"
+      " %19,"
+      " p,   %21, %22, %23, %24;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x72x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x72x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x72x16_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[18];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %23, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n72k16.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17},"
+      "{%18, %19, %20, %21},"
+      " %22,"
+      " p,   %24, %25, %26;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x72x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x80x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x80x16_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[20];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %22, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n80k16.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      " %20,"
+      " %21,"
+      " p,   %23, %24, %25, %26;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x80x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x80x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x80x16_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[20];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %25, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n80k16.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      "{%20, %21, %22, %23},"
+      " %24,"
+      " p,   %26, %27, %28;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x80x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x88x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x88x16_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[22];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %24, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n88k16.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21},"
+      " %22,"
+      " %23,"
+      " p,   %25, %26, %27, %28;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x88x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x88x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x88x16_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[22];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %27, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n88k16.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21},"
+      "{%22, %23, %24, %25},"
+      " %26,"
+      " p,   %28, %29, %30;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x88x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x104x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x104x16_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[26];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %28, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n104k16.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25},"
+      " %26,"
+      " %27,"
+      " p,   %29, %30, %31, %32;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x104x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x104x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x104x16_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[26];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %31, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n104k16.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25},"
+      "{%26, %27, %28, %29},"
+      " %30,"
+      " p,   %32, %33, %34;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x104x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x112x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x112x16_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[28];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %30, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n112k16.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      " %28,"
+      " %29,"
+      " p,   %31, %32, %33, %34;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x112x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x112x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x112x16_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[28];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %33, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n112k16.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      "{%28, %29, %30, %31},"
+      " %32,"
+      " p,   %34, %35, %36;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x112x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x120x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x120x16_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[30];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %32, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n120k16.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29},"
+      " %30,"
+      " %31,"
+      " p,   %33, %34, %35, %36;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x120x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x120x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x120x16_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[30];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %35, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n120k16.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29},"
+      "{%30, %31, %32, %33},"
+      " %34,"
+      " p,   %36, %37, %38;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x120x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x136x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x136x16_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[34];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %36, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n136k16.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33},"
+      " %34,"
+      " %35,"
+      " p,   %37, %38, %39, %40;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x136x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x136x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x136x16_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[34];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %39, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n136k16.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33},"
+      "{%34, %35, %36, %37},"
+      " %38,"
+      " p,   %40, %41, %42;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x136x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x144x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x144x16_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[36];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %38, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n144k16.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      " %36,"
+      " %37,"
+      " p,   %39, %40, %41, %42;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x144x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x144x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x144x16_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[36];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %41, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n144k16.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      "{%36, %37, %38, %39},"
+      " %40,"
+      " p,   %42, %43, %44;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x144x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x152x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x152x16_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[38];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %40, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n152k16.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37},"
+      " %38,"
+      " %39,"
+      " p,   %41, %42, %43, %44;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x152x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x152x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x152x16_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[38];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %43, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n152k16.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37},"
+      "{%38, %39, %40, %41},"
+      " %42,"
+      " p,   %44, %45, %46;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x152x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x160x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x160x16_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %42, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n160k16.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      " %40,"
+      " %41,"
+      " p,   %43, %44, %45, %46;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x160x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x160x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x160x16_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %45, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n160k16.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      "{%40, %41, %42, %43},"
+      " %44,"
+      " p,   %46, %47, %48;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x160x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x168x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x168x16_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[42];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %44, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n168k16.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41},"
+      " %42,"
+      " %43,"
+      " p,   %45, %46, %47, %48;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x168x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x168x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x168x16_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[42];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %47, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n168k16.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41},"
+      "{%42, %43, %44, %45},"
+      " %46,"
+      " p,   %48, %49, %50;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x168x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x176x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x176x16_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[44];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %46, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n176k16.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      " %44,"
+      " %45,"
+      " p,   %47, %48, %49, %50;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x176x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x176x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x176x16_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[44];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %49, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n176k16.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      "{%44, %45, %46, %47},"
+      " %48,"
+      " p,   %50, %51, %52;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x176x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x184x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x184x16_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[46];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %48, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n184k16.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45},"
+      " %46,"
+      " %47,"
+      " p,   %49, %50, %51, %52;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x184x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x184x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x184x16_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[46];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %51, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n184k16.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45},"
+      "{%46, %47, %48, %49},"
+      " %50,"
+      " p,   %52, %53, %54;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x184x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x200x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x200x16_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[50];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %52, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n200k16.f16.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49},"
+      " %50,"
+      " %51,"
+      " p,    %53,  %54,  %55,  %56;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x200x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x200x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x200x16_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[50];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %55, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n200k16.f16.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49},"
+      "{%50,  %51,  %52,  %53},"
+      " %54,"
+      " p,    %56,  %57,  %58;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x200x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x208x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x208x16_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[52];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %54, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n208k16.f16.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      " %52,"
+      " %53,"
+      " p,    %55,  %56,  %57,  %58;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x208x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x208x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x208x16_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[52];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %57, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n208k16.f16.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      "{%52,  %53,  %54,  %55},"
+      " %56,"
+      " p,    %58,  %59,  %60;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x208x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x216x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x216x16_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[54];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %56, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n216k16.f16.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53},"
+      " %54,"
+      " %55,"
+      " p,    %57,  %58,  %59,  %60;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x216x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x216x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x216x16_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[54];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %59, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n216k16.f16.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53},"
+      "{%54,  %55,  %56,  %57},"
+      " %58,"
+      " p,    %60,  %61,  %62;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x216x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x224x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x224x16_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %58, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n224k16.f16.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      " %56,"
+      " %57,"
+      " p,    %59,  %60,  %61,  %62;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x224x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x224x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x224x16_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %61, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n224k16.f16.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      "{%56,  %57,  %58,  %59},"
+      " %60,"
+      " p,    %62,  %63,  %64;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x224x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x232x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x232x16_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[58];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %60, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n232k16.f16.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57},"
+      " %58,"
+      " %59,"
+      " p,    %61,  %62,  %63,  %64;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x232x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x232x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x232x16_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[58];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %63, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n232k16.f16.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57},"
+      "{%58,  %59,  %60,  %61},"
+      " %62,"
+      " p,    %64,  %65,  %66;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x232x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x240x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x240x16_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[60];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %62, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n240k16.f16.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      " %60,"
+      " %61,"
+      " p,    %63,  %64,  %65,  %66;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x240x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x240x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x240x16_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[60];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %65, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n240k16.f16.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      "{%60,  %61,  %62,  %63},"
+      " %64,"
+      " p,    %66,  %67,  %68;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x240x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x248x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x248x16_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[62];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %64, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n248k16.f16.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61},"
+      " %62,"
+      " %63,"
+      " p,    %65,  %66,  %67,  %68;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x248x16_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x248x16 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x248x16_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[62];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %67, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n248k16.f16.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61},"
+      "{%62,  %63,  %64,  %65},"
+      " %66,"
+      " p,    %68,  %69,  %70;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x248x16_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x24x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x24x16_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %14, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n24k16.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      " %12,"
+      " %13,"
+      " p,   %15, %16, %17, %18;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x24x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x24x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x24x16_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[12];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %17, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n24k16.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      "{%12, %13, %14, %15},"
+      " %16,"
+      " p,   %18, %19, %20;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x24x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x40x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x40x16_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[20];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %22, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n40k16.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      " %20,"
+      " %21,"
+      " p,   %23, %24, %25, %26;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x40x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x40x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x40x16_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[20];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %25, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n40k16.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      "{%20, %21, %22, %23},"
+      " %24,"
+      " p,   %26, %27, %28;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x40x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x48x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x48x16_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %26, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n48k16.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      " %24,"
+      " %25,"
+      " p,   %27, %28, %29, %30;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x48x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x48x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x48x16_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[24];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %29, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n48k16.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      "{%24, %25, %26, %27},"
+      " %28,"
+      " p,   %30, %31, %32;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x48x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x56x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x56x16_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[28];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %30, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n56k16.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      " %28,"
+      " %29,"
+      " p,   %31, %32, %33, %34;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x56x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x56x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x56x16_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[28];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %33, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n56k16.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      "{%28, %29, %30, %31},"
+      " %32,"
+      " p,   %34, %35, %36;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x56x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x72x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x72x16_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[36];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %38, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n72k16.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      " %36,"
+      " %37,"
+      " p,   %39, %40, %41, %42;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x72x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x72x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x72x16_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[36];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %41, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n72k16.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      "{%36, %37, %38, %39},"
+      " %40,"
+      " p,   %42, %43, %44;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x72x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x80x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x80x16_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %42, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n80k16.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      " %40,"
+      " %41,"
+      " p,   %43, %44, %45, %46;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x80x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x80x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x80x16_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[40];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %45, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n80k16.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      "{%40, %41, %42, %43},"
+      " %44,"
+      " p,   %46, %47, %48;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x80x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x88x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x88x16_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[44];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %46, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n88k16.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      " %44,"
+      " %45,"
+      " p,   %47, %48, %49, %50;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x88x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x88x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x88x16_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[44];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %49, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n88k16.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      "{%44, %45, %46, %47},"
+      " %48,"
+      " p,   %50, %51, %52;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x88x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x104x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x104x16_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[52];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %54, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n104k16.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      " %52,"
+      " %53,"
+      " p,    %55,  %56,  %57,  %58;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x104x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x104x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x104x16_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[52];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %57, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n104k16.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      "{%52,  %53,  %54,  %55},"
+      " %56,"
+      " p,    %58,  %59,  %60;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x104x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x112x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x112x16_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %58, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n112k16.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      " %56,"
+      " %57,"
+      " p,    %59,  %60,  %61,  %62;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x112x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x112x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x112x16_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[56];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %61, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n112k16.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      "{%56,  %57,  %58,  %59},"
+      " %60,"
+      " p,    %62,  %63,  %64;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x112x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x120x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x120x16_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[60];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %62, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n120k16.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      " %60,"
+      " %61,"
+      " p,    %63,  %64,  %65,  %66;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x120x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x120x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x120x16_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[60];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %65, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n120k16.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      "{%60,  %61,  %62,  %63},"
+      " %64,"
+      " p,    %66,  %67,  %68;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x120x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x136x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x136x16_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[68];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %70, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n136k16.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67},"
+      " %68,"
+      " %69,"
+      " p,    %71,  %72,  %73,  %74;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x136x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x136x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x136x16_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[68];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %73, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n136k16.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67},"
+      "{%68,  %69,  %70,  %71},"
+      " %72,"
+      " p,    %74,  %75,  %76;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x136x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x144x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x144x16_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %74, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n144k16.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      " %72,"
+      " %73,"
+      " p,    %75,  %76,  %77,  %78;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x144x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x144x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x144x16_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[72];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %77, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n144k16.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      "{%72,  %73,  %74,  %75},"
+      " %76,"
+      " p,    %78,  %79,  %80;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x144x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x152x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x152x16_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[76];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %78, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n152k16.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75},"
+      " %76,"
+      " %77,"
+      " p,    %79,  %80,  %81,  %82;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x152x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x152x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x152x16_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[76];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %81, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n152k16.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75},"
+      "{%76,  %77,  %78,  %79},"
+      " %80,"
+      " p,    %82,  %83,  %84;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x152x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x160x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x160x16_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %82, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n160k16.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      " %80,"
+      " %81,"
+      " p,    %83,  %84,  %85,  %86;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x160x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x160x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x160x16_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[80];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %85, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n160k16.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      "{%80,  %81,  %82,  %83},"
+      " %84,"
+      " p,    %86,  %87,  %88;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x160x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x168x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x168x16_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[84];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %86, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n168k16.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83},"
+      " %84,"
+      " %85,"
+      " p,    %87,  %88,  %89,  %90;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x168x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x168x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x168x16_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[84];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %89, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n168k16.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83},"
+      "{%84,  %85,  %86,  %87},"
+      " %88,"
+      " p,    %90,  %91,  %92;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x168x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x176x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x176x16_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %90, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n176k16.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      " %88,"
+      " %89,"
+      " p,    %91,  %92,  %93,  %94;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x176x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x176x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x176x16_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[88];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %93, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n176k16.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      "{%88,  %89,  %90,  %91},"
+      " %92,"
+      " p,    %94,  %95,  %96;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x176x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x184x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x184x16_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[92];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %94, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n184k16.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91},"
+      " %92,"
+      " %93,"
+      " p,    %95,  %96,  %97,  %98;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x184x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x184x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x184x16_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[92];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %97, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n184k16.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91},"
+      "{%92,  %93,  %94,  %95},"
+      " %96,"
+      " p,    %98,  %99,  %100;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x184x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x200x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x200x16_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[100];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %102, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n200k16.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99},"
+      " %100,"
+      " %101,"
+      " p,    %103, %104, %105, %106;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x200x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x200x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x200x16_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[100];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %105, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n200k16.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99},"
+      "{%100, %101, %102, %103},"
+      " %104,"
+      " p,    %106, %107, %108;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x200x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x208x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x208x16_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %106, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n208k16.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      " %104,"
+      " %105,"
+      " p,    %107, %108, %109, %110;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x208x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x208x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x208x16_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[104];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %109, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n208k16.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      "{%104, %105, %106, %107},"
+      " %108,"
+      " p,    %110, %111, %112;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x208x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x216x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x216x16_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[108];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %110, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n216k16.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107},"
+      " %108,"
+      " %109,"
+      " p,    %111, %112, %113, %114;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x216x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x216x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x216x16_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[108];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %113, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n216k16.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107},"
+      "{%108, %109, %110, %111},"
+      " %112,"
+      " p,    %114, %115, %116;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x216x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x224x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x224x16_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %114, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n224k16.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      " %112,"
+      " %113,"
+      " p,    %115, %116, %117, %118;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x224x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x224x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x224x16_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[112];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %117, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n224k16.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      "{%112, %113, %114, %115},"
+      " %116,"
+      " p,    %118, %119, %120;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x224x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x232x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x232x16_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[116];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %118, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n232k16.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115},"
+      " %116,"
+      " %117,"
+      " p,    %119, %120, %121, %122;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x232x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x232x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x232x16_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[116];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %121, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n232k16.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115},"
+      "{%116, %117, %118, %119},"
+      " %120,"
+      " p,    %122, %123, %124;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x232x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x240x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x240x16_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %122, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n240k16.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      " %120,"
+      " %121,"
+      " p,    %123, %124, %125, %126;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x240x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x240x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x240x16_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[120];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %125, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n240k16.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      "{%120, %121, %122, %123},"
+      " %124,"
+      " p,    %126, %127, %128;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x240x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x248x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x248x16_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[124];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %126, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n248k16.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123},"
+      " %124,"
+      " %125,"
+      " p,    %127, %128, %129, %130;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x248x16_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x248x16 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x248x16_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[124];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %129, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n248k16.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123},"
+      "{%124, %125, %126, %127},"
+      " %128,"
+      " p,    %130, %131, %132;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x248x16_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x24x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x24x16_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %14, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n24k16.f32.bf16.bf16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      " %12,"
+      " %13,"
+      " p,   %15, %16, %17, %18;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x24x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x24x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x24x16_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[12];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %17, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n24k16.f32.bf16.bf16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      "{%12, %13, %14, %15},"
+      " %16,"
+      " p,   %18, %19, %20;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x24x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x40x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x40x16_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[20];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %22, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n40k16.f32.bf16.bf16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      " %20,"
+      " %21,"
+      " p,   %23, %24, %25, %26;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x40x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x40x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x40x16_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[20];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %25, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n40k16.f32.bf16.bf16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      "{%20, %21, %22, %23},"
+      " %24,"
+      " p,   %26, %27, %28;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x40x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x48x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x48x16_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %26, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n48k16.f32.bf16.bf16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      " %24,"
+      " %25,"
+      " p,   %27, %28, %29, %30;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x48x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x48x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x48x16_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[24];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %29, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n48k16.f32.bf16.bf16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      "{%24, %25, %26, %27},"
+      " %28,"
+      " p,   %30, %31, %32;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x48x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x56x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x56x16_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[28];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %30, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n56k16.f32.bf16.bf16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      " %28,"
+      " %29,"
+      " p,   %31, %32, %33, %34;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x56x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x56x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x56x16_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[28];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %33, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n56k16.f32.bf16.bf16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      "{%28, %29, %30, %31},"
+      " %32,"
+      " p,   %34, %35, %36;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x56x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x72x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x72x16_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[36];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %38, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n72k16.f32.bf16.bf16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      " %36,"
+      " %37,"
+      " p,   %39, %40, %41, %42;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x72x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x72x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x72x16_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[36];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %41, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n72k16.f32.bf16.bf16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      "{%36, %37, %38, %39},"
+      " %40,"
+      " p,   %42, %43, %44;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x72x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x80x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x80x16_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %42, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n80k16.f32.bf16.bf16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      " %40,"
+      " %41,"
+      " p,   %43, %44, %45, %46;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x80x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x80x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x80x16_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[40];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %45, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n80k16.f32.bf16.bf16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      "{%40, %41, %42, %43},"
+      " %44,"
+      " p,   %46, %47, %48;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x80x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x88x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x88x16_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[44];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %46, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n88k16.f32.bf16.bf16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      " %44,"
+      " %45,"
+      " p,   %47, %48, %49, %50;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x88x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x88x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x88x16_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[44];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %49, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n88k16.f32.bf16.bf16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      "{%44, %45, %46, %47},"
+      " %48,"
+      " p,   %50, %51, %52;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x88x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x104x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x104x16_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[52];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %54, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n104k16.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      " %52,"
+      " %53,"
+      " p,    %55,  %56,  %57,  %58;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x104x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x104x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x104x16_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[52];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %57, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n104k16.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      "{%52,  %53,  %54,  %55},"
+      " %56,"
+      " p,    %58,  %59,  %60;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x104x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x112x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x112x16_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %58, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n112k16.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      " %56,"
+      " %57,"
+      " p,    %59,  %60,  %61,  %62;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x112x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x112x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x112x16_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[56];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %61, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n112k16.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      "{%56,  %57,  %58,  %59},"
+      " %60,"
+      " p,    %62,  %63,  %64;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x112x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x120x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x120x16_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[60];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %62, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n120k16.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      " %60,"
+      " %61,"
+      " p,    %63,  %64,  %65,  %66;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x120x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x120x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x120x16_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[60];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %65, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n120k16.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      "{%60,  %61,  %62,  %63},"
+      " %64,"
+      " p,    %66,  %67,  %68;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x120x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x136x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x136x16_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[68];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %70, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n136k16.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67},"
+      " %68,"
+      " %69,"
+      " p,    %71,  %72,  %73,  %74;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x136x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x136x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x136x16_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[68];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %73, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n136k16.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67},"
+      "{%68,  %69,  %70,  %71},"
+      " %72,"
+      " p,    %74,  %75,  %76;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x136x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x144x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x144x16_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %74, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n144k16.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      " %72,"
+      " %73,"
+      " p,    %75,  %76,  %77,  %78;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x144x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x144x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x144x16_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[72];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %77, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n144k16.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      "{%72,  %73,  %74,  %75},"
+      " %76,"
+      " p,    %78,  %79,  %80;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x144x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x152x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x152x16_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[76];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %78, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n152k16.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75},"
+      " %76,"
+      " %77,"
+      " p,    %79,  %80,  %81,  %82;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x152x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x152x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x152x16_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[76];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %81, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n152k16.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75},"
+      "{%76,  %77,  %78,  %79},"
+      " %80,"
+      " p,    %82,  %83,  %84;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x152x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x160x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x160x16_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %82, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n160k16.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      " %80,"
+      " %81,"
+      " p,    %83,  %84,  %85,  %86;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x160x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x160x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x160x16_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[80];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %85, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n160k16.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      "{%80,  %81,  %82,  %83},"
+      " %84,"
+      " p,    %86,  %87,  %88;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x160x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x168x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x168x16_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[84];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %86, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n168k16.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83},"
+      " %84,"
+      " %85,"
+      " p,    %87,  %88,  %89,  %90;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x168x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x168x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x168x16_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[84];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %89, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n168k16.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83},"
+      "{%84,  %85,  %86,  %87},"
+      " %88,"
+      " p,    %90,  %91,  %92;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x168x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x176x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x176x16_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %90, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n176k16.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      " %88,"
+      " %89,"
+      " p,    %91,  %92,  %93,  %94;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x176x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x176x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x176x16_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[88];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %93, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n176k16.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      "{%88,  %89,  %90,  %91},"
+      " %92,"
+      " p,    %94,  %95,  %96;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x176x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x184x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x184x16_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[92];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %94, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n184k16.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91},"
+      " %92,"
+      " %93,"
+      " p,    %95,  %96,  %97,  %98;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x184x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x184x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x184x16_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[92];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %97, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n184k16.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91},"
+      "{%92,  %93,  %94,  %95},"
+      " %96,"
+      " p,    %98,  %99,  %100;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x184x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x200x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x200x16_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[100];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %102, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n200k16.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99},"
+      " %100,"
+      " %101,"
+      " p,    %103, %104, %105, %106;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x200x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x200x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x200x16_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[100];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %105, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n200k16.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99},"
+      "{%100, %101, %102, %103},"
+      " %104,"
+      " p,    %106, %107, %108;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x200x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x208x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x208x16_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %106, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n208k16.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      " %104,"
+      " %105,"
+      " p,    %107, %108, %109, %110;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x208x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x208x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x208x16_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[104];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %109, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n208k16.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      "{%104, %105, %106, %107},"
+      " %108,"
+      " p,    %110, %111, %112;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x208x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x216x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x216x16_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[108];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %110, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n216k16.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107},"
+      " %108,"
+      " %109,"
+      " p,    %111, %112, %113, %114;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x216x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x216x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x216x16_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[108];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %113, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n216k16.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107},"
+      "{%108, %109, %110, %111},"
+      " %112,"
+      " p,    %114, %115, %116;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x216x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x224x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x224x16_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %114, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n224k16.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      " %112,"
+      " %113,"
+      " p,    %115, %116, %117, %118;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x224x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x224x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x224x16_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[112];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %117, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n224k16.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      "{%112, %113, %114, %115},"
+      " %116,"
+      " p,    %118, %119, %120;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x224x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x232x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x232x16_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[116];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %118, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n232k16.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115},"
+      " %116,"
+      " %117,"
+      " p,    %119, %120, %121, %122;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x232x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x232x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x232x16_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[116];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %121, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n232k16.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115},"
+      "{%116, %117, %118, %119},"
+      " %120,"
+      " p,    %122, %123, %124;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x232x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x240x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x240x16_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %122, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n240k16.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      " %120,"
+      " %121,"
+      " p,    %123, %124, %125, %126;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x240x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x240x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x240x16_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[120];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %125, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n240k16.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      "{%120, %121, %122, %123},"
+      " %124,"
+      " p,    %126, %127, %128;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x240x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x248x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x248x16_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[124];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %126, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n248k16.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123},"
+      " %124,"
+      " %125,"
+      " p,    %127, %128, %129, %130;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x248x16_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x248x16 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x248x16_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[124];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %129, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n248k16.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123},"
+      "{%124, %125, %126, %127},"
+      " %128,"
+      " p,    %130, %131, %132;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x248x16_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x24x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x24x8_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %14, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n24k8.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      " %12,"
+      " %13,"
+      " p,   %15, %16;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x24x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x24x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x24x8_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %17, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n24k8.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      "{%12, %13, %14, %15},"
+      " %16,"
+      " p,   %18, %19;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x24x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x40x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x40x8_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[20];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %22, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n40k8.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      " %20,"
+      " %21,"
+      " p,   %23, %24;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x40x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x40x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x40x8_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[20];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %25, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n40k8.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      "{%20, %21, %22, %23},"
+      " %24,"
+      " p,   %26, %27;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x40x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x48x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x48x8_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %26, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n48k8.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      " %24,"
+      " %25,"
+      " p,   %27, %28;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x48x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x48x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x48x8_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %29, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n48k8.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      "{%24, %25, %26, %27},"
+      " %28,"
+      " p,   %30, %31;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x48x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x56x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x56x8_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[28];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %30, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n56k8.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      " %28,"
+      " %29,"
+      " p,   %31, %32;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x56x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x56x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x56x8_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[28];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %33, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n56k8.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      "{%28, %29, %30, %31},"
+      " %32,"
+      " p,   %34, %35;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x56x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x72x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x72x8_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[36];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %38, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n72k8.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      " %36,"
+      " %37,"
+      " p,   %39, %40;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x72x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x72x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x72x8_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[36];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %41, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n72k8.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      "{%36, %37, %38, %39},"
+      " %40,"
+      " p,   %42, %43;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x72x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x80x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x80x8_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %42, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n80k8.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      " %40,"
+      " %41,"
+      " p,   %43, %44;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x80x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x80x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x80x8_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %45, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n80k8.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      "{%40, %41, %42, %43},"
+      " %44,"
+      " p,   %46, %47;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x80x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x88x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x88x8_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[44];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %46, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n88k8.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      " %44,"
+      " %45,"
+      " p,   %47, %48;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x88x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x88x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x88x8_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[44];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %49, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n88k8.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      "{%44, %45, %46, %47},"
+      " %48,"
+      " p,   %50, %51;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x88x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x104x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x104x8_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[52];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %54, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n104k8.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      " %52,"
+      " %53,"
+      " p,    %55,  %56;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x104x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x104x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x104x8_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[52];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %57, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n104k8.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      "{%52,  %53,  %54,  %55},"
+      " %56,"
+      " p,    %58,  %59;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x104x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x112x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x112x8_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %58, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n112k8.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      " %56,"
+      " %57,"
+      " p,    %59,  %60;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x112x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x112x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x112x8_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %61, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n112k8.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      "{%56,  %57,  %58,  %59},"
+      " %60,"
+      " p,    %62,  %63;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x112x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x120x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x120x8_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[60];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %62, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n120k8.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      " %60,"
+      " %61,"
+      " p,    %63,  %64;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x120x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x120x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x120x8_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[60];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %65, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n120k8.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      "{%60,  %61,  %62,  %63},"
+      " %64,"
+      " p,    %66,  %67;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x120x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x136x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x136x8_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[68];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %70, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n136k8.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67},"
+      " %68,"
+      " %69,"
+      " p,    %71,  %72;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x136x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x136x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x136x8_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[68];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %73, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n136k8.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67},"
+      "{%68,  %69,  %70,  %71},"
+      " %72,"
+      " p,    %74,  %75;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x136x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x144x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x144x8_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %74, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n144k8.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      " %72,"
+      " %73,"
+      " p,    %75,  %76;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x144x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x144x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x144x8_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %77, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n144k8.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      "{%72,  %73,  %74,  %75},"
+      " %76,"
+      " p,    %78,  %79;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x144x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x152x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x152x8_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[76];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %78, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n152k8.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75},"
+      " %76,"
+      " %77,"
+      " p,    %79,  %80;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x152x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x152x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x152x8_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[76];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %81, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n152k8.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75},"
+      "{%76,  %77,  %78,  %79},"
+      " %80,"
+      " p,    %82,  %83;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x152x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x160x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x160x8_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %82, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n160k8.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      " %80,"
+      " %81,"
+      " p,    %83,  %84;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x160x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x160x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x160x8_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %85, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n160k8.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      "{%80,  %81,  %82,  %83},"
+      " %84,"
+      " p,    %86,  %87;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x160x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x168x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x168x8_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[84];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %86, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n168k8.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83},"
+      " %84,"
+      " %85,"
+      " p,    %87,  %88;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x168x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x168x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x168x8_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[84];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %89, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n168k8.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83},"
+      "{%84,  %85,  %86,  %87},"
+      " %88,"
+      " p,    %90,  %91;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x168x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x176x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x176x8_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %90, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n176k8.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      " %88,"
+      " %89,"
+      " p,    %91,  %92;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x176x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x176x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x176x8_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %93, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n176k8.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      "{%88,  %89,  %90,  %91},"
+      " %92,"
+      " p,    %94,  %95;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x176x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x184x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x184x8_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[92];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %94, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n184k8.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91},"
+      " %92,"
+      " %93,"
+      " p,    %95,  %96;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x184x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x184x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x184x8_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[92];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %97, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n184k8.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91},"
+      "{%92,  %93,  %94,  %95},"
+      " %96,"
+      " p,    %98,  %99;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x184x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x200x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x200x8_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[100];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %102, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n200k8.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99},"
+      " %100,"
+      " %101,"
+      " p,    %103, %104;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x200x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x200x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x200x8_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[100];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %105, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n200k8.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99},"
+      "{%100, %101, %102, %103},"
+      " %104,"
+      " p,    %106, %107;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x200x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x208x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x208x8_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %106, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n208k8.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      " %104,"
+      " %105,"
+      " p,    %107, %108;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x208x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x208x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x208x8_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %109, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n208k8.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      "{%104, %105, %106, %107},"
+      " %108,"
+      " p,    %110, %111;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x208x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x216x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x216x8_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[108];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %110, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n216k8.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107},"
+      " %108,"
+      " %109,"
+      " p,    %111, %112;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x216x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x216x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x216x8_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[108];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %113, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n216k8.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107},"
+      "{%108, %109, %110, %111},"
+      " %112,"
+      " p,    %114, %115;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x216x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x224x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x224x8_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %114, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n224k8.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      " %112,"
+      " %113,"
+      " p,    %115, %116;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x224x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x224x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x224x8_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %117, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n224k8.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      "{%112, %113, %114, %115},"
+      " %116,"
+      " p,    %118, %119;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x224x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x232x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x232x8_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[116];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %118, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n232k8.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115},"
+      " %116,"
+      " %117,"
+      " p,    %119, %120;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x232x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x232x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x232x8_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[116];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %121, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n232k8.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115},"
+      "{%116, %117, %118, %119},"
+      " %120,"
+      " p,    %122, %123;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x232x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x240x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x240x8_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %122, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n240k8.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      " %120,"
+      " %121,"
+      " p,    %123, %124;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x240x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x240x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x240x8_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %125, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n240k8.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      "{%120, %121, %122, %123},"
+      " %124,"
+      " p,    %126, %127;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x240x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x248x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x248x8_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[124];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %126, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n248k8.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123},"
+      " %124,"
+      " %125,"
+      " p,    %127, %128;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x248x8_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x248x8 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x248x8_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[124];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %129, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n248k8.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123},"
+      "{%124, %125, %126, %127},"
+      " %128,"
+      " p,    %130, %131;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x248x8_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x24x32 TN S32+=S8*S8
+struct MMA_64x24x32_S32S8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %14, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n24k32.s32.s8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      " %12,"
+      " %13,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x24x32_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x24x32 TN S32+=S8*S8
+struct MMA_64x24x32_S32S8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %14, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n24k32.s32.s8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      " %12,"
+      " %13,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x24x32_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x48x32 TN S32+=S8*S8
+struct MMA_64x48x32_S32S8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %26, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n48k32.s32.s8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      " %24,"
+      " %25,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x48x32_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x48x32 TN S32+=S8*S8
+struct MMA_64x48x32_S32S8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %26, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n48k32.s32.s8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      " %24,"
+      " %25,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x48x32_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x80x32 TN S32+=S8*S8
+struct MMA_64x80x32_S32S8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %42, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n80k32.s32.s8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      " %40,"
+      " %41,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x80x32_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x80x32 TN S32+=S8*S8
+struct MMA_64x80x32_S32S8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %42, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n80k32.s32.s8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      " %40,"
+      " %41,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x80x32_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x112x32 TN S32+=S8*S8
+struct MMA_64x112x32_S32S8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %58, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n112k32.s32.s8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      " %56,"
+      " %57,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x112x32_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x112x32 TN S32+=S8*S8
+struct MMA_64x112x32_S32S8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %58, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n112k32.s32.s8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      " %56,"
+      " %57,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x112x32_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x144x32 TN S32+=S8*S8
+struct MMA_64x144x32_S32S8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %74, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n144k32.s32.s8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      " %72,"
+      " %73,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x144x32_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x144x32 TN S32+=S8*S8
+struct MMA_64x144x32_S32S8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %74, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n144k32.s32.s8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      " %72,"
+      " %73,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x144x32_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x160x32 TN S32+=S8*S8
+struct MMA_64x160x32_S32S8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %82, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n160k32.s32.s8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      " %80,"
+      " %81,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x160x32_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x160x32 TN S32+=S8*S8
+struct MMA_64x160x32_S32S8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %82, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n160k32.s32.s8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      " %80,"
+      " %81,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x160x32_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x176x32 TN S32+=S8*S8
+struct MMA_64x176x32_S32S8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %90, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n176k32.s32.s8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      " %88,"
+      " %89,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x176x32_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x176x32 TN S32+=S8*S8
+struct MMA_64x176x32_S32S8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %90, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n176k32.s32.s8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      " %88,"
+      " %89,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x176x32_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x208x32 TN S32+=S8*S8
+struct MMA_64x208x32_S32S8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %106, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n208k32.s32.s8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      " %104,"
+      " %105,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x208x32_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x208x32 TN S32+=S8*S8
+struct MMA_64x208x32_S32S8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %106, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n208k32.s32.s8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      " %104,"
+      " %105,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x208x32_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x224x32 TN S32+=S8*S8
+struct MMA_64x224x32_S32S8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %114, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n224k32.s32.s8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      " %112,"
+      " %113,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x224x32_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x224x32 TN S32+=S8*S8
+struct MMA_64x224x32_S32S8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %114, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n224k32.s32.s8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      " %112,"
+      " %113,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x224x32_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x240x32 TN S32+=S8*S8
+struct MMA_64x240x32_S32S8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %122, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n240k32.s32.s8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      " %120,"
+      " %121,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x240x32_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x240x32 TN S32+=S8*S8
+struct MMA_64x240x32_S32S8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %122, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n240k32.s32.s8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      " %120,"
+      " %121,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x240x32_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x24x32 TN S32+=S8*S8
+struct MMA_64x24x32_S32S8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %17, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n24k32.s32.s8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      "{%12, %13, %14, %15},"
+      " %16,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x24x32_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x24x32 TN S32+=S8*S8
+struct MMA_64x24x32_S32S8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %17, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n24k32.s32.s8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      "{%12, %13, %14, %15},"
+      " %16,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x24x32_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x48x32 TN S32+=S8*S8
+struct MMA_64x48x32_S32S8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %29, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n48k32.s32.s8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      "{%24, %25, %26, %27},"
+      " %28,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x48x32_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x48x32 TN S32+=S8*S8
+struct MMA_64x48x32_S32S8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %29, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n48k32.s32.s8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      "{%24, %25, %26, %27},"
+      " %28,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x48x32_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x80x32 TN S32+=S8*S8
+struct MMA_64x80x32_S32S8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %45, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n80k32.s32.s8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      "{%40, %41, %42, %43},"
+      " %44,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x80x32_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x80x32 TN S32+=S8*S8
+struct MMA_64x80x32_S32S8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %45, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n80k32.s32.s8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      "{%40, %41, %42, %43},"
+      " %44,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x80x32_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x112x32 TN S32+=S8*S8
+struct MMA_64x112x32_S32S8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %61, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n112k32.s32.s8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      "{%56,  %57,  %58,  %59},"
+      " %60,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x112x32_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x112x32 TN S32+=S8*S8
+struct MMA_64x112x32_S32S8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %61, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n112k32.s32.s8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      "{%56,  %57,  %58,  %59},"
+      " %60,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x112x32_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x144x32 TN S32+=S8*S8
+struct MMA_64x144x32_S32S8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %77, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n144k32.s32.s8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      "{%72,  %73,  %74,  %75},"
+      " %76,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x144x32_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x144x32 TN S32+=S8*S8
+struct MMA_64x144x32_S32S8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %77, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n144k32.s32.s8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      "{%72,  %73,  %74,  %75},"
+      " %76,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x144x32_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x160x32 TN S32+=S8*S8
+struct MMA_64x160x32_S32S8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %85, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n160k32.s32.s8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      "{%80,  %81,  %82,  %83},"
+      " %84,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x160x32_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x160x32 TN S32+=S8*S8
+struct MMA_64x160x32_S32S8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %85, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n160k32.s32.s8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      "{%80,  %81,  %82,  %83},"
+      " %84,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x160x32_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x176x32 TN S32+=S8*S8
+struct MMA_64x176x32_S32S8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %93, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n176k32.s32.s8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      "{%88,  %89,  %90,  %91},"
+      " %92,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x176x32_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x176x32 TN S32+=S8*S8
+struct MMA_64x176x32_S32S8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %93, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n176k32.s32.s8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      "{%88,  %89,  %90,  %91},"
+      " %92,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x176x32_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x208x32 TN S32+=S8*S8
+struct MMA_64x208x32_S32S8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %109, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n208k32.s32.s8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      "{%104, %105, %106, %107},"
+      " %108,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x208x32_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x208x32 TN S32+=S8*S8
+struct MMA_64x208x32_S32S8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %109, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n208k32.s32.s8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      "{%104, %105, %106, %107},"
+      " %108,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x208x32_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x224x32 TN S32+=S8*S8
+struct MMA_64x224x32_S32S8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %117, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n224k32.s32.s8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      "{%112, %113, %114, %115},"
+      " %116,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x224x32_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x224x32 TN S32+=S8*S8
+struct MMA_64x224x32_S32S8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %117, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n224k32.s32.s8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      "{%112, %113, %114, %115},"
+      " %116,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x224x32_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x240x32 TN S32+=S8*S8
+struct MMA_64x240x32_S32S8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %125, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n240k32.s32.s8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      "{%120, %121, %122, %123},"
+      " %124,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x240x32_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x240x32 TN S32+=S8*S8
+struct MMA_64x240x32_S32S8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %125, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n240k32.s32.s8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      "{%120, %121, %122, %123},"
+      " %124,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x240x32_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x24x32 TN S32+=S8*U8
+struct MMA_64x24x32_S32S8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %14, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n24k32.s32.s8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      " %12,"
+      " %13,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x24x32_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x24x32 TN S32+=S8*U8
+struct MMA_64x24x32_S32S8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %14, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n24k32.s32.s8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      " %12,"
+      " %13,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x24x32_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x48x32 TN S32+=S8*U8
+struct MMA_64x48x32_S32S8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %26, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n48k32.s32.s8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      " %24,"
+      " %25,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x48x32_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x48x32 TN S32+=S8*U8
+struct MMA_64x48x32_S32S8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %26, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n48k32.s32.s8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      " %24,"
+      " %25,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x48x32_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x80x32 TN S32+=S8*U8
+struct MMA_64x80x32_S32S8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %42, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n80k32.s32.s8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      " %40,"
+      " %41,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x80x32_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x80x32 TN S32+=S8*U8
+struct MMA_64x80x32_S32S8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %42, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n80k32.s32.s8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      " %40,"
+      " %41,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x80x32_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x112x32 TN S32+=S8*U8
+struct MMA_64x112x32_S32S8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %58, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n112k32.s32.s8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      " %56,"
+      " %57,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x112x32_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x112x32 TN S32+=S8*U8
+struct MMA_64x112x32_S32S8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %58, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n112k32.s32.s8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      " %56,"
+      " %57,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x112x32_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x144x32 TN S32+=S8*U8
+struct MMA_64x144x32_S32S8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %74, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n144k32.s32.s8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      " %72,"
+      " %73,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x144x32_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x144x32 TN S32+=S8*U8
+struct MMA_64x144x32_S32S8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %74, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n144k32.s32.s8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      " %72,"
+      " %73,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x144x32_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x160x32 TN S32+=S8*U8
+struct MMA_64x160x32_S32S8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %82, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n160k32.s32.s8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      " %80,"
+      " %81,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x160x32_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x160x32 TN S32+=S8*U8
+struct MMA_64x160x32_S32S8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %82, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n160k32.s32.s8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      " %80,"
+      " %81,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x160x32_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x176x32 TN S32+=S8*U8
+struct MMA_64x176x32_S32S8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %90, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n176k32.s32.s8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      " %88,"
+      " %89,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x176x32_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x176x32 TN S32+=S8*U8
+struct MMA_64x176x32_S32S8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %90, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n176k32.s32.s8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      " %88,"
+      " %89,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x176x32_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x208x32 TN S32+=S8*U8
+struct MMA_64x208x32_S32S8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %106, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n208k32.s32.s8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      " %104,"
+      " %105,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x208x32_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x208x32 TN S32+=S8*U8
+struct MMA_64x208x32_S32S8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %106, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n208k32.s32.s8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      " %104,"
+      " %105,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x208x32_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x224x32 TN S32+=S8*U8
+struct MMA_64x224x32_S32S8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %114, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n224k32.s32.s8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      " %112,"
+      " %113,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x224x32_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x224x32 TN S32+=S8*U8
+struct MMA_64x224x32_S32S8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %114, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n224k32.s32.s8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      " %112,"
+      " %113,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x224x32_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x240x32 TN S32+=S8*U8
+struct MMA_64x240x32_S32S8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %122, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n240k32.s32.s8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      " %120,"
+      " %121,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x240x32_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x240x32 TN S32+=S8*U8
+struct MMA_64x240x32_S32S8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %122, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n240k32.s32.s8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      " %120,"
+      " %121,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x240x32_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x24x32 TN S32+=S8*U8
+struct MMA_64x24x32_S32S8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %17, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n24k32.s32.s8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      "{%12, %13, %14, %15},"
+      " %16,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x24x32_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x24x32 TN S32+=S8*U8
+struct MMA_64x24x32_S32S8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %17, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n24k32.s32.s8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      "{%12, %13, %14, %15},"
+      " %16,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x24x32_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x48x32 TN S32+=S8*U8
+struct MMA_64x48x32_S32S8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %29, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n48k32.s32.s8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      "{%24, %25, %26, %27},"
+      " %28,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x48x32_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x48x32 TN S32+=S8*U8
+struct MMA_64x48x32_S32S8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %29, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n48k32.s32.s8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      "{%24, %25, %26, %27},"
+      " %28,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x48x32_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x80x32 TN S32+=S8*U8
+struct MMA_64x80x32_S32S8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %45, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n80k32.s32.s8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      "{%40, %41, %42, %43},"
+      " %44,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x80x32_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x80x32 TN S32+=S8*U8
+struct MMA_64x80x32_S32S8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %45, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n80k32.s32.s8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      "{%40, %41, %42, %43},"
+      " %44,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x80x32_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x112x32 TN S32+=S8*U8
+struct MMA_64x112x32_S32S8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %61, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n112k32.s32.s8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      "{%56,  %57,  %58,  %59},"
+      " %60,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x112x32_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x112x32 TN S32+=S8*U8
+struct MMA_64x112x32_S32S8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %61, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n112k32.s32.s8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      "{%56,  %57,  %58,  %59},"
+      " %60,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x112x32_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x144x32 TN S32+=S8*U8
+struct MMA_64x144x32_S32S8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %77, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n144k32.s32.s8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      "{%72,  %73,  %74,  %75},"
+      " %76,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x144x32_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x144x32 TN S32+=S8*U8
+struct MMA_64x144x32_S32S8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %77, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n144k32.s32.s8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      "{%72,  %73,  %74,  %75},"
+      " %76,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x144x32_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x160x32 TN S32+=S8*U8
+struct MMA_64x160x32_S32S8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %85, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n160k32.s32.s8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      "{%80,  %81,  %82,  %83},"
+      " %84,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x160x32_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x160x32 TN S32+=S8*U8
+struct MMA_64x160x32_S32S8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %85, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n160k32.s32.s8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      "{%80,  %81,  %82,  %83},"
+      " %84,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x160x32_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x176x32 TN S32+=S8*U8
+struct MMA_64x176x32_S32S8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %93, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n176k32.s32.s8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      "{%88,  %89,  %90,  %91},"
+      " %92,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x176x32_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x176x32 TN S32+=S8*U8
+struct MMA_64x176x32_S32S8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %93, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n176k32.s32.s8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      "{%88,  %89,  %90,  %91},"
+      " %92,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x176x32_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x208x32 TN S32+=S8*U8
+struct MMA_64x208x32_S32S8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %109, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n208k32.s32.s8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      "{%104, %105, %106, %107},"
+      " %108,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x208x32_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x208x32 TN S32+=S8*U8
+struct MMA_64x208x32_S32S8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %109, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n208k32.s32.s8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      "{%104, %105, %106, %107},"
+      " %108,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x208x32_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x224x32 TN S32+=S8*U8
+struct MMA_64x224x32_S32S8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %117, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n224k32.s32.s8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      "{%112, %113, %114, %115},"
+      " %116,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x224x32_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x224x32 TN S32+=S8*U8
+struct MMA_64x224x32_S32S8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %117, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n224k32.s32.s8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      "{%112, %113, %114, %115},"
+      " %116,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x224x32_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x240x32 TN S32+=S8*U8
+struct MMA_64x240x32_S32S8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %125, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n240k32.s32.s8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      "{%120, %121, %122, %123},"
+      " %124,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x240x32_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x240x32 TN S32+=S8*U8
+struct MMA_64x240x32_S32S8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %125, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n240k32.s32.s8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      "{%120, %121, %122, %123},"
+      " %124,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x240x32_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x24x32 TN S32+=U8*S8
+struct MMA_64x24x32_S32U8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %14, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n24k32.s32.u8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      " %12,"
+      " %13,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x24x32_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x24x32 TN S32+=U8*S8
+struct MMA_64x24x32_S32U8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %14, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n24k32.s32.u8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      " %12,"
+      " %13,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x24x32_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x48x32 TN S32+=U8*S8
+struct MMA_64x48x32_S32U8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %26, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n48k32.s32.u8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      " %24,"
+      " %25,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x48x32_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x48x32 TN S32+=U8*S8
+struct MMA_64x48x32_S32U8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %26, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n48k32.s32.u8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      " %24,"
+      " %25,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x48x32_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x80x32 TN S32+=U8*S8
+struct MMA_64x80x32_S32U8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %42, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n80k32.s32.u8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      " %40,"
+      " %41,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x80x32_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x80x32 TN S32+=U8*S8
+struct MMA_64x80x32_S32U8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %42, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n80k32.s32.u8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      " %40,"
+      " %41,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x80x32_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x112x32 TN S32+=U8*S8
+struct MMA_64x112x32_S32U8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %58, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n112k32.s32.u8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      " %56,"
+      " %57,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x112x32_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x112x32 TN S32+=U8*S8
+struct MMA_64x112x32_S32U8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %58, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n112k32.s32.u8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      " %56,"
+      " %57,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x112x32_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x144x32 TN S32+=U8*S8
+struct MMA_64x144x32_S32U8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %74, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n144k32.s32.u8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      " %72,"
+      " %73,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x144x32_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x144x32 TN S32+=U8*S8
+struct MMA_64x144x32_S32U8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %74, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n144k32.s32.u8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      " %72,"
+      " %73,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x144x32_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x160x32 TN S32+=U8*S8
+struct MMA_64x160x32_S32U8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %82, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n160k32.s32.u8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      " %80,"
+      " %81,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x160x32_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x160x32 TN S32+=U8*S8
+struct MMA_64x160x32_S32U8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %82, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n160k32.s32.u8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      " %80,"
+      " %81,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x160x32_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x176x32 TN S32+=U8*S8
+struct MMA_64x176x32_S32U8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %90, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n176k32.s32.u8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      " %88,"
+      " %89,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x176x32_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x176x32 TN S32+=U8*S8
+struct MMA_64x176x32_S32U8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %90, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n176k32.s32.u8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      " %88,"
+      " %89,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x176x32_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x208x32 TN S32+=U8*S8
+struct MMA_64x208x32_S32U8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %106, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n208k32.s32.u8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      " %104,"
+      " %105,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x208x32_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x208x32 TN S32+=U8*S8
+struct MMA_64x208x32_S32U8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %106, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n208k32.s32.u8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      " %104,"
+      " %105,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x208x32_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x224x32 TN S32+=U8*S8
+struct MMA_64x224x32_S32U8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %114, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n224k32.s32.u8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      " %112,"
+      " %113,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x224x32_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x224x32 TN S32+=U8*S8
+struct MMA_64x224x32_S32U8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %114, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n224k32.s32.u8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      " %112,"
+      " %113,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x224x32_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x240x32 TN S32+=U8*S8
+struct MMA_64x240x32_S32U8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %122, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n240k32.s32.u8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      " %120,"
+      " %121,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x240x32_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x240x32 TN S32+=U8*S8
+struct MMA_64x240x32_S32U8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %122, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n240k32.s32.u8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      " %120,"
+      " %121,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x240x32_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x24x32 TN S32+=U8*S8
+struct MMA_64x24x32_S32U8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %17, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n24k32.s32.u8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      "{%12, %13, %14, %15},"
+      " %16,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x24x32_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x24x32 TN S32+=U8*S8
+struct MMA_64x24x32_S32U8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %17, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n24k32.s32.u8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      "{%12, %13, %14, %15},"
+      " %16,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x24x32_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x48x32 TN S32+=U8*S8
+struct MMA_64x48x32_S32U8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %29, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n48k32.s32.u8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      "{%24, %25, %26, %27},"
+      " %28,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x48x32_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x48x32 TN S32+=U8*S8
+struct MMA_64x48x32_S32U8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %29, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n48k32.s32.u8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      "{%24, %25, %26, %27},"
+      " %28,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x48x32_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x80x32 TN S32+=U8*S8
+struct MMA_64x80x32_S32U8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %45, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n80k32.s32.u8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      "{%40, %41, %42, %43},"
+      " %44,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x80x32_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x80x32 TN S32+=U8*S8
+struct MMA_64x80x32_S32U8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %45, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n80k32.s32.u8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      "{%40, %41, %42, %43},"
+      " %44,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x80x32_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x112x32 TN S32+=U8*S8
+struct MMA_64x112x32_S32U8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %61, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n112k32.s32.u8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      "{%56,  %57,  %58,  %59},"
+      " %60,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x112x32_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x112x32 TN S32+=U8*S8
+struct MMA_64x112x32_S32U8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %61, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n112k32.s32.u8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      "{%56,  %57,  %58,  %59},"
+      " %60,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x112x32_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x144x32 TN S32+=U8*S8
+struct MMA_64x144x32_S32U8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %77, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n144k32.s32.u8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      "{%72,  %73,  %74,  %75},"
+      " %76,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x144x32_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x144x32 TN S32+=U8*S8
+struct MMA_64x144x32_S32U8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %77, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n144k32.s32.u8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      "{%72,  %73,  %74,  %75},"
+      " %76,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x144x32_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x160x32 TN S32+=U8*S8
+struct MMA_64x160x32_S32U8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %85, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n160k32.s32.u8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      "{%80,  %81,  %82,  %83},"
+      " %84,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x160x32_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x160x32 TN S32+=U8*S8
+struct MMA_64x160x32_S32U8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %85, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n160k32.s32.u8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      "{%80,  %81,  %82,  %83},"
+      " %84,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x160x32_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x176x32 TN S32+=U8*S8
+struct MMA_64x176x32_S32U8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %93, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n176k32.s32.u8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      "{%88,  %89,  %90,  %91},"
+      " %92,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x176x32_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x176x32 TN S32+=U8*S8
+struct MMA_64x176x32_S32U8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %93, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n176k32.s32.u8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      "{%88,  %89,  %90,  %91},"
+      " %92,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x176x32_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x208x32 TN S32+=U8*S8
+struct MMA_64x208x32_S32U8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %109, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n208k32.s32.u8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      "{%104, %105, %106, %107},"
+      " %108,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x208x32_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x208x32 TN S32+=U8*S8
+struct MMA_64x208x32_S32U8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %109, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n208k32.s32.u8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      "{%104, %105, %106, %107},"
+      " %108,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x208x32_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x224x32 TN S32+=U8*S8
+struct MMA_64x224x32_S32U8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %117, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n224k32.s32.u8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      "{%112, %113, %114, %115},"
+      " %116,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x224x32_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x224x32 TN S32+=U8*S8
+struct MMA_64x224x32_S32U8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %117, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n224k32.s32.u8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      "{%112, %113, %114, %115},"
+      " %116,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x224x32_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x240x32 TN S32+=U8*S8
+struct MMA_64x240x32_S32U8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %125, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n240k32.s32.u8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      "{%120, %121, %122, %123},"
+      " %124,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x240x32_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x240x32 TN S32+=U8*S8
+struct MMA_64x240x32_S32U8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %125, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n240k32.s32.u8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      "{%120, %121, %122, %123},"
+      " %124,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x240x32_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x24x32 TN S32+=U8*U8
+struct MMA_64x24x32_S32U8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %14, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n24k32.s32.u8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      " %12,"
+      " %13,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x24x32_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x24x32 TN S32+=U8*U8
+struct MMA_64x24x32_S32U8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %14, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n24k32.s32.u8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      " %12,"
+      " %13,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x24x32_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x48x32 TN S32+=U8*U8
+struct MMA_64x48x32_S32U8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %26, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n48k32.s32.u8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      " %24,"
+      " %25,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x48x32_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x48x32 TN S32+=U8*U8
+struct MMA_64x48x32_S32U8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %26, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n48k32.s32.u8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      " %24,"
+      " %25,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x48x32_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x80x32 TN S32+=U8*U8
+struct MMA_64x80x32_S32U8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %42, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n80k32.s32.u8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      " %40,"
+      " %41,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x80x32_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x80x32 TN S32+=U8*U8
+struct MMA_64x80x32_S32U8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %42, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n80k32.s32.u8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      " %40,"
+      " %41,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x80x32_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x112x32 TN S32+=U8*U8
+struct MMA_64x112x32_S32U8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %58, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n112k32.s32.u8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      " %56,"
+      " %57,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x112x32_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x112x32 TN S32+=U8*U8
+struct MMA_64x112x32_S32U8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %58, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n112k32.s32.u8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      " %56,"
+      " %57,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x112x32_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x144x32 TN S32+=U8*U8
+struct MMA_64x144x32_S32U8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %74, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n144k32.s32.u8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      " %72,"
+      " %73,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x144x32_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x144x32 TN S32+=U8*U8
+struct MMA_64x144x32_S32U8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %74, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n144k32.s32.u8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      " %72,"
+      " %73,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x144x32_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x160x32 TN S32+=U8*U8
+struct MMA_64x160x32_S32U8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %82, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n160k32.s32.u8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      " %80,"
+      " %81,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x160x32_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x160x32 TN S32+=U8*U8
+struct MMA_64x160x32_S32U8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %82, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n160k32.s32.u8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      " %80,"
+      " %81,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x160x32_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x176x32 TN S32+=U8*U8
+struct MMA_64x176x32_S32U8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %90, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n176k32.s32.u8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      " %88,"
+      " %89,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x176x32_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x176x32 TN S32+=U8*U8
+struct MMA_64x176x32_S32U8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %90, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n176k32.s32.u8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      " %88,"
+      " %89,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x176x32_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x208x32 TN S32+=U8*U8
+struct MMA_64x208x32_S32U8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %106, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n208k32.s32.u8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      " %104,"
+      " %105,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x208x32_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x208x32 TN S32+=U8*U8
+struct MMA_64x208x32_S32U8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %106, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n208k32.s32.u8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      " %104,"
+      " %105,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x208x32_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x224x32 TN S32+=U8*U8
+struct MMA_64x224x32_S32U8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %114, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n224k32.s32.u8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      " %112,"
+      " %113,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x224x32_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x224x32 TN S32+=U8*U8
+struct MMA_64x224x32_S32U8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %114, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n224k32.s32.u8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      " %112,"
+      " %113,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x224x32_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x240x32 TN S32+=U8*U8
+struct MMA_64x240x32_S32U8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %122, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n240k32.s32.u8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      " %120,"
+      " %121,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x240x32_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x240x32 TN S32+=U8*U8
+struct MMA_64x240x32_S32U8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %122, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n240k32.s32.u8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      " %120,"
+      " %121,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x240x32_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x24x32 TN S32+=U8*U8
+struct MMA_64x24x32_S32U8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %17, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n24k32.s32.u8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      "{%12, %13, %14, %15},"
+      " %16,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x24x32_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x24x32 TN S32+=U8*U8
+struct MMA_64x24x32_S32U8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %17, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n24k32.s32.u8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      "{%12, %13, %14, %15},"
+      " %16,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x24x32_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x48x32 TN S32+=U8*U8
+struct MMA_64x48x32_S32U8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %29, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n48k32.s32.u8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      "{%24, %25, %26, %27},"
+      " %28,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x48x32_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x48x32 TN S32+=U8*U8
+struct MMA_64x48x32_S32U8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %29, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n48k32.s32.u8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      "{%24, %25, %26, %27},"
+      " %28,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x48x32_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x80x32 TN S32+=U8*U8
+struct MMA_64x80x32_S32U8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %45, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n80k32.s32.u8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      "{%40, %41, %42, %43},"
+      " %44,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x80x32_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x80x32 TN S32+=U8*U8
+struct MMA_64x80x32_S32U8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %45, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n80k32.s32.u8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      "{%40, %41, %42, %43},"
+      " %44,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x80x32_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x112x32 TN S32+=U8*U8
+struct MMA_64x112x32_S32U8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %61, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n112k32.s32.u8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      "{%56,  %57,  %58,  %59},"
+      " %60,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x112x32_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x112x32 TN S32+=U8*U8
+struct MMA_64x112x32_S32U8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %61, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n112k32.s32.u8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      "{%56,  %57,  %58,  %59},"
+      " %60,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x112x32_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x144x32 TN S32+=U8*U8
+struct MMA_64x144x32_S32U8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %77, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n144k32.s32.u8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      "{%72,  %73,  %74,  %75},"
+      " %76,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x144x32_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x144x32 TN S32+=U8*U8
+struct MMA_64x144x32_S32U8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %77, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n144k32.s32.u8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      "{%72,  %73,  %74,  %75},"
+      " %76,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x144x32_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x160x32 TN S32+=U8*U8
+struct MMA_64x160x32_S32U8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %85, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n160k32.s32.u8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      "{%80,  %81,  %82,  %83},"
+      " %84,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x160x32_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x160x32 TN S32+=U8*U8
+struct MMA_64x160x32_S32U8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %85, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n160k32.s32.u8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      "{%80,  %81,  %82,  %83},"
+      " %84,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x160x32_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x176x32 TN S32+=U8*U8
+struct MMA_64x176x32_S32U8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %93, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n176k32.s32.u8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      "{%88,  %89,  %90,  %91},"
+      " %92,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x176x32_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x176x32 TN S32+=U8*U8
+struct MMA_64x176x32_S32U8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %93, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n176k32.s32.u8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      "{%88,  %89,  %90,  %91},"
+      " %92,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x176x32_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x208x32 TN S32+=U8*U8
+struct MMA_64x208x32_S32U8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %109, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n208k32.s32.u8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      "{%104, %105, %106, %107},"
+      " %108,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x208x32_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x208x32 TN S32+=U8*U8
+struct MMA_64x208x32_S32U8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %109, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n208k32.s32.u8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      "{%104, %105, %106, %107},"
+      " %108,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x208x32_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x224x32 TN S32+=U8*U8
+struct MMA_64x224x32_S32U8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %117, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n224k32.s32.u8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      "{%112, %113, %114, %115},"
+      " %116,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x224x32_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x224x32 TN S32+=U8*U8
+struct MMA_64x224x32_S32U8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %117, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n224k32.s32.u8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      "{%112, %113, %114, %115},"
+      " %116,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x224x32_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x240x32 TN S32+=U8*U8
+struct MMA_64x240x32_S32U8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %125, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n240k32.s32.u8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      "{%120, %121, %122, %123},"
+      " %124,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x240x32_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x240x32 TN S32+=U8*U8
+struct MMA_64x240x32_S32U8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %125, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n240k32.s32.u8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      "{%120, %121, %122, %123},"
+      " %124,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x240x32_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x24x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x24x32_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[6];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %8, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n24k32.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5},"
+      " %6,"
+      " %7,"
+      " p,   %9,  %10;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x24x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x24x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x24x32_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[6];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %11, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n24k32.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5},"
+      "{%6,  %7,  %8,  %9},"
+      " %10,"
+      " p,   %12, %13;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x24x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x24x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x24x32_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %14, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n24k32.f32.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      " %12,"
+      " %13,"
+      " p,   %15, %16;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x24x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x24x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x24x32_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %17, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n24k32.f32.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      "{%12, %13, %14, %15},"
+      " %16,"
+      " p,   %18, %19;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x24x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x40x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x40x32_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[10];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %12, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n40k32.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9},"
+      " %10,"
+      " %11,"
+      " p,   %13, %14;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x40x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x40x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x40x32_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[10];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %15, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n40k32.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9},"
+      "{%10, %11, %12, %13},"
+      " %14,"
+      " p,   %16, %17;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x40x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x40x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x40x32_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[20];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %22, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n40k32.f32.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      " %20,"
+      " %21,"
+      " p,   %23, %24;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x40x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x40x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x40x32_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[20];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %25, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n40k32.f32.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      "{%20, %21, %22, %23},"
+      " %24,"
+      " p,   %26, %27;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x40x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x48x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x48x32_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %14, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n48k32.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      " %12,"
+      " %13,"
+      " p,   %15, %16;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x48x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x48x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x48x32_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %17, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n48k32.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      "{%12, %13, %14, %15},"
+      " %16,"
+      " p,   %18, %19;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x48x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x48x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x48x32_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %26, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n48k32.f32.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      " %24,"
+      " %25,"
+      " p,   %27, %28;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x48x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x48x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x48x32_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %29, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n48k32.f32.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      "{%24, %25, %26, %27},"
+      " %28,"
+      " p,   %30, %31;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x48x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x56x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x56x32_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[14];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %16, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n56k32.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13},"
+      " %14,"
+      " %15,"
+      " p,   %17, %18;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x56x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x56x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x56x32_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[14];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %19, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n56k32.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13},"
+      "{%14, %15, %16, %17},"
+      " %18,"
+      " p,   %20, %21;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x56x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x56x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x56x32_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[28];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %30, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n56k32.f32.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      " %28,"
+      " %29,"
+      " p,   %31, %32;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x56x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x56x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x56x32_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[28];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %33, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n56k32.f32.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      "{%28, %29, %30, %31},"
+      " %32,"
+      " p,   %34, %35;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x56x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x72x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x72x32_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[18];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %20, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n72k32.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17},"
+      " %18,"
+      " %19,"
+      " p,   %21, %22;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x72x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x72x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x72x32_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[18];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %23, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n72k32.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17},"
+      "{%18, %19, %20, %21},"
+      " %22,"
+      " p,   %24, %25;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x72x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x72x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x72x32_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[36];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %38, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n72k32.f32.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      " %36,"
+      " %37,"
+      " p,   %39, %40;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x72x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x72x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x72x32_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[36];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %41, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n72k32.f32.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      "{%36, %37, %38, %39},"
+      " %40,"
+      " p,   %42, %43;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x72x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x80x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x80x32_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[20];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %22, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n80k32.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      " %20,"
+      " %21,"
+      " p,   %23, %24;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x80x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x80x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x80x32_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[20];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %25, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n80k32.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      "{%20, %21, %22, %23},"
+      " %24,"
+      " p,   %26, %27;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x80x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x80x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x80x32_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %42, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n80k32.f32.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      " %40,"
+      " %41,"
+      " p,   %43, %44;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x80x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x80x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x80x32_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %45, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n80k32.f32.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      "{%40, %41, %42, %43},"
+      " %44,"
+      " p,   %46, %47;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x80x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x88x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x88x32_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[22];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %24, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n88k32.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21},"
+      " %22,"
+      " %23,"
+      " p,   %25, %26;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x88x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x88x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x88x32_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[22];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %27, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n88k32.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21},"
+      "{%22, %23, %24, %25},"
+      " %26,"
+      " p,   %28, %29;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x88x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x88x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x88x32_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[44];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %46, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n88k32.f32.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      " %44,"
+      " %45,"
+      " p,   %47, %48;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x88x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x88x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x88x32_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[44];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %49, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n88k32.f32.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      "{%44, %45, %46, %47},"
+      " %48,"
+      " p,   %50, %51;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x88x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x104x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x104x32_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[26];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %28, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n104k32.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25},"
+      " %26,"
+      " %27,"
+      " p,   %29, %30;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x104x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x104x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x104x32_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[26];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %31, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n104k32.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25},"
+      "{%26, %27, %28, %29},"
+      " %30,"
+      " p,   %32, %33;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x104x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x104x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x104x32_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[52];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %54, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n104k32.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      " %52,"
+      " %53,"
+      " p,    %55,  %56;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x104x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x104x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x104x32_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[52];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %57, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n104k32.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      "{%52,  %53,  %54,  %55},"
+      " %56,"
+      " p,    %58,  %59;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x104x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x112x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x112x32_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[28];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %30, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n112k32.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      " %28,"
+      " %29,"
+      " p,   %31, %32;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x112x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x112x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x112x32_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[28];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %33, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n112k32.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      "{%28, %29, %30, %31},"
+      " %32,"
+      " p,   %34, %35;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x112x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x112x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x112x32_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %58, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n112k32.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      " %56,"
+      " %57,"
+      " p,    %59,  %60;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x112x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x112x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x112x32_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %61, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n112k32.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      "{%56,  %57,  %58,  %59},"
+      " %60,"
+      " p,    %62,  %63;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x112x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x120x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x120x32_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[30];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %32, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n120k32.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29},"
+      " %30,"
+      " %31,"
+      " p,   %33, %34;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x120x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x120x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x120x32_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[30];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %35, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n120k32.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29},"
+      "{%30, %31, %32, %33},"
+      " %34,"
+      " p,   %36, %37;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x120x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x120x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x120x32_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[60];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %62, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n120k32.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      " %60,"
+      " %61,"
+      " p,    %63,  %64;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x120x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x120x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x120x32_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[60];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %65, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n120k32.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      "{%60,  %61,  %62,  %63},"
+      " %64,"
+      " p,    %66,  %67;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x120x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x136x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x136x32_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[34];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %36, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n136k32.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33},"
+      " %34,"
+      " %35,"
+      " p,   %37, %38;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x136x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x136x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x136x32_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[34];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %39, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n136k32.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33},"
+      "{%34, %35, %36, %37},"
+      " %38,"
+      " p,   %40, %41;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x136x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x136x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x136x32_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[68];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %70, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n136k32.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67},"
+      " %68,"
+      " %69,"
+      " p,    %71,  %72;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x136x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x136x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x136x32_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[68];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %73, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n136k32.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67},"
+      "{%68,  %69,  %70,  %71},"
+      " %72,"
+      " p,    %74,  %75;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x136x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x144x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x144x32_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[36];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %38, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n144k32.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      " %36,"
+      " %37,"
+      " p,   %39, %40;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x144x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x144x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x144x32_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[36];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %41, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n144k32.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      "{%36, %37, %38, %39},"
+      " %40,"
+      " p,   %42, %43;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x144x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x144x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x144x32_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %74, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n144k32.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      " %72,"
+      " %73,"
+      " p,    %75,  %76;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x144x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x144x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x144x32_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %77, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n144k32.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      "{%72,  %73,  %74,  %75},"
+      " %76,"
+      " p,    %78,  %79;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x144x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x152x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x152x32_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[38];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %40, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n152k32.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37},"
+      " %38,"
+      " %39,"
+      " p,   %41, %42;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x152x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x152x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x152x32_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[38];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %43, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n152k32.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37},"
+      "{%38, %39, %40, %41},"
+      " %42,"
+      " p,   %44, %45;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x152x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x152x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x152x32_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[76];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %78, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n152k32.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75},"
+      " %76,"
+      " %77,"
+      " p,    %79,  %80;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x152x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x152x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x152x32_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[76];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %81, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n152k32.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75},"
+      "{%76,  %77,  %78,  %79},"
+      " %80,"
+      " p,    %82,  %83;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x152x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x160x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x160x32_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %42, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n160k32.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      " %40,"
+      " %41,"
+      " p,   %43, %44;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x160x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x160x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x160x32_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %45, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n160k32.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      "{%40, %41, %42, %43},"
+      " %44,"
+      " p,   %46, %47;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x160x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x160x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x160x32_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %82, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n160k32.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      " %80,"
+      " %81,"
+      " p,    %83,  %84;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x160x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x160x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x160x32_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %85, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n160k32.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      "{%80,  %81,  %82,  %83},"
+      " %84,"
+      " p,    %86,  %87;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x160x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x168x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x168x32_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[42];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %44, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n168k32.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41},"
+      " %42,"
+      " %43,"
+      " p,   %45, %46;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x168x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x168x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x168x32_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[42];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %47, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n168k32.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41},"
+      "{%42, %43, %44, %45},"
+      " %46,"
+      " p,   %48, %49;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x168x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x168x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x168x32_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[84];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %86, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n168k32.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83},"
+      " %84,"
+      " %85,"
+      " p,    %87,  %88;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x168x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x168x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x168x32_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[84];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %89, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n168k32.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83},"
+      "{%84,  %85,  %86,  %87},"
+      " %88,"
+      " p,    %90,  %91;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x168x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x176x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x176x32_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[44];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %46, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n176k32.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      " %44,"
+      " %45,"
+      " p,   %47, %48;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x176x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x176x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x176x32_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[44];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %49, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n176k32.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      "{%44, %45, %46, %47},"
+      " %48,"
+      " p,   %50, %51;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x176x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x176x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x176x32_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %90, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n176k32.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      " %88,"
+      " %89,"
+      " p,    %91,  %92;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x176x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x176x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x176x32_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %93, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n176k32.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      "{%88,  %89,  %90,  %91},"
+      " %92,"
+      " p,    %94,  %95;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x176x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x184x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x184x32_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[46];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %48, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n184k32.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45},"
+      " %46,"
+      " %47,"
+      " p,   %49, %50;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x184x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x184x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x184x32_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[46];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %51, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n184k32.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45},"
+      "{%46, %47, %48, %49},"
+      " %50,"
+      " p,   %52, %53;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x184x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x184x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x184x32_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[92];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %94, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n184k32.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91},"
+      " %92,"
+      " %93,"
+      " p,    %95,  %96;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x184x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x184x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x184x32_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[92];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %97, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n184k32.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91},"
+      "{%92,  %93,  %94,  %95},"
+      " %96,"
+      " p,    %98,  %99;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x184x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x200x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x200x32_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[50];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %52, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n200k32.f16.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49},"
+      " %50,"
+      " %51,"
+      " p,    %53,  %54;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x200x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x200x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x200x32_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[50];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %55, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n200k32.f16.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49},"
+      "{%50,  %51,  %52,  %53},"
+      " %54,"
+      " p,    %56,  %57;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x200x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x200x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x200x32_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[100];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %102, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n200k32.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99},"
+      " %100,"
+      " %101,"
+      " p,    %103, %104;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x200x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x200x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x200x32_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[100];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %105, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n200k32.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99},"
+      "{%100, %101, %102, %103},"
+      " %104,"
+      " p,    %106, %107;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x200x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x208x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x208x32_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[52];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %54, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n208k32.f16.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      " %52,"
+      " %53,"
+      " p,    %55,  %56;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x208x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x208x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x208x32_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[52];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %57, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n208k32.f16.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      "{%52,  %53,  %54,  %55},"
+      " %56,"
+      " p,    %58,  %59;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x208x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x208x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x208x32_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %106, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n208k32.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      " %104,"
+      " %105,"
+      " p,    %107, %108;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x208x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x208x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x208x32_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %109, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n208k32.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      "{%104, %105, %106, %107},"
+      " %108,"
+      " p,    %110, %111;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x208x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x216x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x216x32_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[54];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %56, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n216k32.f16.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53},"
+      " %54,"
+      " %55,"
+      " p,    %57,  %58;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x216x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x216x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x216x32_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[54];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %59, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n216k32.f16.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53},"
+      "{%54,  %55,  %56,  %57},"
+      " %58,"
+      " p,    %60,  %61;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x216x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x216x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x216x32_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[108];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %110, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n216k32.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107},"
+      " %108,"
+      " %109,"
+      " p,    %111, %112;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x216x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x216x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x216x32_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[108];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %113, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n216k32.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107},"
+      "{%108, %109, %110, %111},"
+      " %112,"
+      " p,    %114, %115;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x216x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x224x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x224x32_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %58, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n224k32.f16.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      " %56,"
+      " %57,"
+      " p,    %59,  %60;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x224x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x224x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x224x32_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %61, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n224k32.f16.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      "{%56,  %57,  %58,  %59},"
+      " %60,"
+      " p,    %62,  %63;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x224x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x224x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x224x32_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %114, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n224k32.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      " %112,"
+      " %113,"
+      " p,    %115, %116;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x224x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x224x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x224x32_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %117, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n224k32.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      "{%112, %113, %114, %115},"
+      " %116,"
+      " p,    %118, %119;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x224x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x232x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x232x32_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[58];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %60, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n232k32.f16.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57},"
+      " %58,"
+      " %59,"
+      " p,    %61,  %62;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x232x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x232x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x232x32_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[58];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %63, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n232k32.f16.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57},"
+      "{%58,  %59,  %60,  %61},"
+      " %62,"
+      " p,    %64,  %65;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x232x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x232x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x232x32_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[116];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %118, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n232k32.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115},"
+      " %116,"
+      " %117,"
+      " p,    %119, %120;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x232x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x232x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x232x32_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[116];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %121, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n232k32.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115},"
+      "{%116, %117, %118, %119},"
+      " %120,"
+      " p,    %122, %123;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x232x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x240x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x240x32_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[60];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %62, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n240k32.f16.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      " %60,"
+      " %61,"
+      " p,    %63,  %64;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x240x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x240x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x240x32_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[60];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %65, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n240k32.f16.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      "{%60,  %61,  %62,  %63},"
+      " %64,"
+      " p,    %66,  %67;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x240x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x240x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x240x32_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %122, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n240k32.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      " %120,"
+      " %121,"
+      " p,    %123, %124;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x240x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x240x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x240x32_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %125, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n240k32.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      "{%120, %121, %122, %123},"
+      " %124,"
+      " p,    %126, %127;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x240x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x248x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x248x32_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[62];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %64, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n248k32.f16.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61},"
+      " %62,"
+      " %63,"
+      " p,    %65,  %66;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x248x32_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x248x32 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x248x32_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[62];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %67, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n248k32.f16.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61},"
+      "{%62,  %63,  %64,  %65},"
+      " %66,"
+      " p,    %68,  %69;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x248x32_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x248x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x248x32_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[124];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %126, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n248k32.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123},"
+      " %124,"
+      " %125,"
+      " p,    %127, %128;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x248x32_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x248x32 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x248x32_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[124];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %129, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n248k32.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123},"
+      "{%124, %125, %126, %127},"
+      " %128,"
+      " p,    %130, %131;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x248x32_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x24x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x24x32_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[6];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %8, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n24k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5},"
+      " %6,"
+      " %7,"
+      " p,   %9,  %10;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x24x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x24x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x24x32_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[6];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %11, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n24k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5},"
+      "{%6,  %7,  %8,  %9},"
+      " %10,"
+      " p,   %12, %13;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x24x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x24x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x24x32_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %14, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n24k32.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      " %12,"
+      " %13,"
+      " p,   %15, %16;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x24x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x24x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x24x32_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %17, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n24k32.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      "{%12, %13, %14, %15},"
+      " %16,"
+      " p,   %18, %19;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x24x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x40x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x40x32_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[10];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %12, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n40k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9},"
+      " %10,"
+      " %11,"
+      " p,   %13, %14;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x40x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x40x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x40x32_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[10];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %15, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n40k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9},"
+      "{%10, %11, %12, %13},"
+      " %14,"
+      " p,   %16, %17;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x40x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x40x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x40x32_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[20];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %22, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n40k32.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      " %20,"
+      " %21,"
+      " p,   %23, %24;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x40x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x40x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x40x32_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[20];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %25, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n40k32.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      "{%20, %21, %22, %23},"
+      " %24,"
+      " p,   %26, %27;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x40x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x48x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x48x32_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %14, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n48k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      " %12,"
+      " %13,"
+      " p,   %15, %16;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x48x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x48x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x48x32_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %17, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n48k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      "{%12, %13, %14, %15},"
+      " %16,"
+      " p,   %18, %19;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x48x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x48x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x48x32_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %26, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n48k32.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      " %24,"
+      " %25,"
+      " p,   %27, %28;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x48x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x48x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x48x32_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %29, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n48k32.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      "{%24, %25, %26, %27},"
+      " %28,"
+      " p,   %30, %31;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x48x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x56x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x56x32_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[14];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %16, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n56k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13},"
+      " %14,"
+      " %15,"
+      " p,   %17, %18;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x56x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x56x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x56x32_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[14];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %19, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n56k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13},"
+      "{%14, %15, %16, %17},"
+      " %18,"
+      " p,   %20, %21;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x56x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x56x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x56x32_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[28];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %30, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n56k32.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      " %28,"
+      " %29,"
+      " p,   %31, %32;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x56x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x56x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x56x32_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[28];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %33, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n56k32.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      "{%28, %29, %30, %31},"
+      " %32,"
+      " p,   %34, %35;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x56x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x72x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x72x32_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[18];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %20, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n72k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17},"
+      " %18,"
+      " %19,"
+      " p,   %21, %22;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x72x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x72x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x72x32_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[18];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %23, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n72k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17},"
+      "{%18, %19, %20, %21},"
+      " %22,"
+      " p,   %24, %25;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x72x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x72x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x72x32_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[36];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %38, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n72k32.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      " %36,"
+      " %37,"
+      " p,   %39, %40;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x72x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x72x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x72x32_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[36];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %41, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n72k32.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      "{%36, %37, %38, %39},"
+      " %40,"
+      " p,   %42, %43;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x72x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x80x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x80x32_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[20];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %22, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n80k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      " %20,"
+      " %21,"
+      " p,   %23, %24;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x80x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x80x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x80x32_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[20];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %25, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n80k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      "{%20, %21, %22, %23},"
+      " %24,"
+      " p,   %26, %27;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x80x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x80x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x80x32_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %42, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n80k32.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      " %40,"
+      " %41,"
+      " p,   %43, %44;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x80x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x80x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x80x32_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %45, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n80k32.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      "{%40, %41, %42, %43},"
+      " %44,"
+      " p,   %46, %47;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x80x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x88x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x88x32_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[22];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %24, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n88k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21},"
+      " %22,"
+      " %23,"
+      " p,   %25, %26;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x88x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x88x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x88x32_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[22];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %27, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n88k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21},"
+      "{%22, %23, %24, %25},"
+      " %26,"
+      " p,   %28, %29;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x88x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x88x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x88x32_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[44];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %46, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n88k32.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      " %44,"
+      " %45,"
+      " p,   %47, %48;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x88x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x88x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x88x32_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[44];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %49, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n88k32.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      "{%44, %45, %46, %47},"
+      " %48,"
+      " p,   %50, %51;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x88x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x104x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x104x32_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[26];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %28, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n104k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25},"
+      " %26,"
+      " %27,"
+      " p,   %29, %30;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x104x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x104x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x104x32_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[26];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %31, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n104k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25},"
+      "{%26, %27, %28, %29},"
+      " %30,"
+      " p,   %32, %33;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x104x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x104x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x104x32_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[52];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %54, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n104k32.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      " %52,"
+      " %53,"
+      " p,    %55,  %56;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x104x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x104x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x104x32_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[52];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %57, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n104k32.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      "{%52,  %53,  %54,  %55},"
+      " %56,"
+      " p,    %58,  %59;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x104x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x112x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x112x32_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[28];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %30, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n112k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      " %28,"
+      " %29,"
+      " p,   %31, %32;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x112x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x112x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x112x32_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[28];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %33, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n112k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      "{%28, %29, %30, %31},"
+      " %32,"
+      " p,   %34, %35;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x112x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x112x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x112x32_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %58, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n112k32.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      " %56,"
+      " %57,"
+      " p,    %59,  %60;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x112x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x112x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x112x32_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %61, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n112k32.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      "{%56,  %57,  %58,  %59},"
+      " %60,"
+      " p,    %62,  %63;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x112x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x120x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x120x32_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[30];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %32, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n120k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29},"
+      " %30,"
+      " %31,"
+      " p,   %33, %34;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x120x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x120x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x120x32_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[30];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %35, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n120k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29},"
+      "{%30, %31, %32, %33},"
+      " %34,"
+      " p,   %36, %37;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x120x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x120x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x120x32_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[60];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %62, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n120k32.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      " %60,"
+      " %61,"
+      " p,    %63,  %64;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x120x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x120x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x120x32_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[60];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %65, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n120k32.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      "{%60,  %61,  %62,  %63},"
+      " %64,"
+      " p,    %66,  %67;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x120x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x136x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x136x32_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[34];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %36, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n136k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33},"
+      " %34,"
+      " %35,"
+      " p,   %37, %38;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x136x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x136x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x136x32_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[34];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %39, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n136k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33},"
+      "{%34, %35, %36, %37},"
+      " %38,"
+      " p,   %40, %41;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x136x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x136x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x136x32_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[68];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %70, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n136k32.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67},"
+      " %68,"
+      " %69,"
+      " p,    %71,  %72;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x136x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x136x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x136x32_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[68];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %73, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n136k32.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67},"
+      "{%68,  %69,  %70,  %71},"
+      " %72,"
+      " p,    %74,  %75;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x136x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x144x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x144x32_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[36];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %38, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n144k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      " %36,"
+      " %37,"
+      " p,   %39, %40;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x144x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x144x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x144x32_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[36];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %41, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n144k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      "{%36, %37, %38, %39},"
+      " %40,"
+      " p,   %42, %43;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x144x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x144x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x144x32_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %74, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n144k32.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      " %72,"
+      " %73,"
+      " p,    %75,  %76;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x144x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x144x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x144x32_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %77, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n144k32.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      "{%72,  %73,  %74,  %75},"
+      " %76,"
+      " p,    %78,  %79;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x144x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x152x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x152x32_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[38];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %40, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n152k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37},"
+      " %38,"
+      " %39,"
+      " p,   %41, %42;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x152x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x152x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x152x32_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[38];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %43, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n152k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37},"
+      "{%38, %39, %40, %41},"
+      " %42,"
+      " p,   %44, %45;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x152x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x152x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x152x32_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[76];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %78, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n152k32.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75},"
+      " %76,"
+      " %77,"
+      " p,    %79,  %80;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x152x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x152x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x152x32_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[76];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %81, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n152k32.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75},"
+      "{%76,  %77,  %78,  %79},"
+      " %80,"
+      " p,    %82,  %83;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x152x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x160x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x160x32_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %42, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n160k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      " %40,"
+      " %41,"
+      " p,   %43, %44;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x160x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x160x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x160x32_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %45, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n160k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      "{%40, %41, %42, %43},"
+      " %44,"
+      " p,   %46, %47;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x160x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x160x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x160x32_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %82, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n160k32.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      " %80,"
+      " %81,"
+      " p,    %83,  %84;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x160x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x160x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x160x32_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %85, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n160k32.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      "{%80,  %81,  %82,  %83},"
+      " %84,"
+      " p,    %86,  %87;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x160x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x168x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x168x32_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[42];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %44, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n168k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41},"
+      " %42,"
+      " %43,"
+      " p,   %45, %46;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x168x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x168x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x168x32_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[42];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %47, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n168k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41},"
+      "{%42, %43, %44, %45},"
+      " %46,"
+      " p,   %48, %49;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x168x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x168x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x168x32_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[84];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %86, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n168k32.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83},"
+      " %84,"
+      " %85,"
+      " p,    %87,  %88;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x168x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x168x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x168x32_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[84];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %89, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n168k32.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83},"
+      "{%84,  %85,  %86,  %87},"
+      " %88,"
+      " p,    %90,  %91;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x168x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x176x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x176x32_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[44];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %46, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n176k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      " %44,"
+      " %45,"
+      " p,   %47, %48;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x176x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x176x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x176x32_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[44];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %49, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n176k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      "{%44, %45, %46, %47},"
+      " %48,"
+      " p,   %50, %51;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x176x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x176x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x176x32_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %90, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n176k32.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      " %88,"
+      " %89,"
+      " p,    %91,  %92;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x176x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x176x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x176x32_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %93, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n176k32.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      "{%88,  %89,  %90,  %91},"
+      " %92,"
+      " p,    %94,  %95;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x176x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x184x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x184x32_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[46];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %48, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n184k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45},"
+      " %46,"
+      " %47,"
+      " p,   %49, %50;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x184x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x184x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x184x32_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[46];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %51, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n184k32.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45},"
+      "{%46, %47, %48, %49},"
+      " %50,"
+      " p,   %52, %53;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x184x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x184x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x184x32_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[92];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %94, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n184k32.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91},"
+      " %92,"
+      " %93,"
+      " p,    %95,  %96;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x184x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x184x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x184x32_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[92];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %97, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n184k32.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91},"
+      "{%92,  %93,  %94,  %95},"
+      " %96,"
+      " p,    %98,  %99;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x184x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x200x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x200x32_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[50];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %52, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n200k32.f16.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49},"
+      " %50,"
+      " %51,"
+      " p,    %53,  %54;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x200x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x200x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x200x32_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[50];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %55, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n200k32.f16.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49},"
+      "{%50,  %51,  %52,  %53},"
+      " %54,"
+      " p,    %56,  %57;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x200x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x200x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x200x32_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[100];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %102, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n200k32.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99},"
+      " %100,"
+      " %101,"
+      " p,    %103, %104;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x200x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x200x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x200x32_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[100];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %105, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n200k32.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99},"
+      "{%100, %101, %102, %103},"
+      " %104,"
+      " p,    %106, %107;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x200x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x208x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x208x32_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[52];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %54, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n208k32.f16.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      " %52,"
+      " %53,"
+      " p,    %55,  %56;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x208x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x208x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x208x32_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[52];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %57, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n208k32.f16.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      "{%52,  %53,  %54,  %55},"
+      " %56,"
+      " p,    %58,  %59;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x208x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x208x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x208x32_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %106, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n208k32.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      " %104,"
+      " %105,"
+      " p,    %107, %108;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x208x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x208x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x208x32_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %109, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n208k32.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      "{%104, %105, %106, %107},"
+      " %108,"
+      " p,    %110, %111;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x208x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x216x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x216x32_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[54];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %56, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n216k32.f16.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53},"
+      " %54,"
+      " %55,"
+      " p,    %57,  %58;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x216x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x216x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x216x32_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[54];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %59, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n216k32.f16.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53},"
+      "{%54,  %55,  %56,  %57},"
+      " %58,"
+      " p,    %60,  %61;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x216x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x216x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x216x32_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[108];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %110, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n216k32.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107},"
+      " %108,"
+      " %109,"
+      " p,    %111, %112;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x216x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x216x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x216x32_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[108];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %113, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n216k32.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107},"
+      "{%108, %109, %110, %111},"
+      " %112,"
+      " p,    %114, %115;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x216x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x224x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x224x32_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %58, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n224k32.f16.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      " %56,"
+      " %57,"
+      " p,    %59,  %60;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x224x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x224x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x224x32_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %61, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n224k32.f16.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      "{%56,  %57,  %58,  %59},"
+      " %60,"
+      " p,    %62,  %63;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x224x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x224x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x224x32_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %114, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n224k32.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      " %112,"
+      " %113,"
+      " p,    %115, %116;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x224x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x224x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x224x32_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %117, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n224k32.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      "{%112, %113, %114, %115},"
+      " %116,"
+      " p,    %118, %119;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x224x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x232x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x232x32_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[58];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %60, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n232k32.f16.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57},"
+      " %58,"
+      " %59,"
+      " p,    %61,  %62;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x232x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x232x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x232x32_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[58];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %63, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n232k32.f16.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57},"
+      "{%58,  %59,  %60,  %61},"
+      " %62,"
+      " p,    %64,  %65;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x232x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x232x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x232x32_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[116];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %118, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n232k32.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115},"
+      " %116,"
+      " %117,"
+      " p,    %119, %120;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x232x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x232x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x232x32_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[116];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %121, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n232k32.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115},"
+      "{%116, %117, %118, %119},"
+      " %120,"
+      " p,    %122, %123;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x232x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x240x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x240x32_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[60];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %62, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n240k32.f16.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      " %60,"
+      " %61,"
+      " p,    %63,  %64;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x240x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x240x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x240x32_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[60];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %65, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n240k32.f16.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      "{%60,  %61,  %62,  %63},"
+      " %64,"
+      " p,    %66,  %67;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x240x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x240x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x240x32_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %122, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n240k32.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      " %120,"
+      " %121,"
+      " p,    %123, %124;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x240x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x240x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x240x32_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %125, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n240k32.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      "{%120, %121, %122, %123},"
+      " %124,"
+      " p,    %126, %127;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x240x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x248x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x248x32_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[62];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %64, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n248k32.f16.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61},"
+      " %62,"
+      " %63,"
+      " p,    %65,  %66;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x248x32_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x248x32 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x248x32_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[62];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %67, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n248k32.f16.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61},"
+      "{%62,  %63,  %64,  %65},"
+      " %66,"
+      " p,    %68,  %69;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x248x32_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x248x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x248x32_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[124];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %126, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n248k32.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123},"
+      " %124,"
+      " %125,"
+      " p,    %127, %128;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x248x32_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x248x32 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x248x32_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[124];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %129, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n248k32.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123},"
+      "{%124, %125, %126, %127},"
+      " %128,"
+      " p,    %130, %131;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x248x32_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x24x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x24x32_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[6];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %8, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n24k32.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5},"
+      " %6,"
+      " %7,"
+      " p,   %9,  %10;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x24x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x24x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x24x32_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[6];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %11, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n24k32.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5},"
+      "{%6,  %7,  %8,  %9},"
+      " %10,"
+      " p,   %12, %13;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x24x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x24x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x24x32_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %14, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n24k32.f32.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      " %12,"
+      " %13,"
+      " p,   %15, %16;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x24x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x24x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x24x32_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %17, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n24k32.f32.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      "{%12, %13, %14, %15},"
+      " %16,"
+      " p,   %18, %19;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x24x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x40x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x40x32_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[10];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %12, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n40k32.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9},"
+      " %10,"
+      " %11,"
+      " p,   %13, %14;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x40x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x40x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x40x32_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[10];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %15, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n40k32.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9},"
+      "{%10, %11, %12, %13},"
+      " %14,"
+      " p,   %16, %17;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x40x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x40x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x40x32_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[20];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %22, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n40k32.f32.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      " %20,"
+      " %21,"
+      " p,   %23, %24;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x40x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x40x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x40x32_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[20];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %25, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n40k32.f32.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      "{%20, %21, %22, %23},"
+      " %24,"
+      " p,   %26, %27;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x40x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x48x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x48x32_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %14, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n48k32.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      " %12,"
+      " %13,"
+      " p,   %15, %16;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x48x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x48x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x48x32_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %17, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n48k32.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      "{%12, %13, %14, %15},"
+      " %16,"
+      " p,   %18, %19;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x48x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x48x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x48x32_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %26, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n48k32.f32.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      " %24,"
+      " %25,"
+      " p,   %27, %28;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x48x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x48x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x48x32_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %29, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n48k32.f32.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      "{%24, %25, %26, %27},"
+      " %28,"
+      " p,   %30, %31;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x48x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x56x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x56x32_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[14];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %16, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n56k32.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13},"
+      " %14,"
+      " %15,"
+      " p,   %17, %18;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x56x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x56x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x56x32_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[14];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %19, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n56k32.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13},"
+      "{%14, %15, %16, %17},"
+      " %18,"
+      " p,   %20, %21;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x56x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x56x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x56x32_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[28];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %30, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n56k32.f32.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      " %28,"
+      " %29,"
+      " p,   %31, %32;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x56x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x56x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x56x32_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[28];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %33, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n56k32.f32.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      "{%28, %29, %30, %31},"
+      " %32,"
+      " p,   %34, %35;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x56x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x72x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x72x32_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[18];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %20, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n72k32.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17},"
+      " %18,"
+      " %19,"
+      " p,   %21, %22;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x72x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x72x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x72x32_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[18];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %23, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n72k32.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17},"
+      "{%18, %19, %20, %21},"
+      " %22,"
+      " p,   %24, %25;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x72x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x72x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x72x32_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[36];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %38, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n72k32.f32.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      " %36,"
+      " %37,"
+      " p,   %39, %40;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x72x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x72x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x72x32_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[36];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %41, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n72k32.f32.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      "{%36, %37, %38, %39},"
+      " %40,"
+      " p,   %42, %43;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x72x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x80x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x80x32_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[20];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %22, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n80k32.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      " %20,"
+      " %21,"
+      " p,   %23, %24;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x80x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x80x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x80x32_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[20];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %25, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n80k32.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      "{%20, %21, %22, %23},"
+      " %24,"
+      " p,   %26, %27;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x80x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x80x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x80x32_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %42, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n80k32.f32.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      " %40,"
+      " %41,"
+      " p,   %43, %44;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x80x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x80x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x80x32_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %45, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n80k32.f32.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      "{%40, %41, %42, %43},"
+      " %44,"
+      " p,   %46, %47;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x80x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x88x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x88x32_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[22];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %24, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n88k32.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21},"
+      " %22,"
+      " %23,"
+      " p,   %25, %26;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x88x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x88x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x88x32_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[22];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %27, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n88k32.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21},"
+      "{%22, %23, %24, %25},"
+      " %26,"
+      " p,   %28, %29;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x88x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x88x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x88x32_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[44];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %46, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n88k32.f32.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      " %44,"
+      " %45,"
+      " p,   %47, %48;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x88x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x88x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x88x32_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[44];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %49, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n88k32.f32.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      "{%44, %45, %46, %47},"
+      " %48,"
+      " p,   %50, %51;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x88x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x104x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x104x32_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[26];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %28, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n104k32.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25},"
+      " %26,"
+      " %27,"
+      " p,   %29, %30;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x104x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x104x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x104x32_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[26];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %31, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n104k32.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25},"
+      "{%26, %27, %28, %29},"
+      " %30,"
+      " p,   %32, %33;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x104x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x104x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x104x32_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[52];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %54, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n104k32.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      " %52,"
+      " %53,"
+      " p,    %55,  %56;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x104x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x104x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x104x32_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[52];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %57, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n104k32.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      "{%52,  %53,  %54,  %55},"
+      " %56,"
+      " p,    %58,  %59;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x104x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x112x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x112x32_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[28];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %30, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n112k32.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      " %28,"
+      " %29,"
+      " p,   %31, %32;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x112x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x112x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x112x32_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[28];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %33, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n112k32.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      "{%28, %29, %30, %31},"
+      " %32,"
+      " p,   %34, %35;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x112x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x112x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x112x32_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %58, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n112k32.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      " %56,"
+      " %57,"
+      " p,    %59,  %60;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x112x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x112x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x112x32_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %61, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n112k32.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      "{%56,  %57,  %58,  %59},"
+      " %60,"
+      " p,    %62,  %63;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x112x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x120x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x120x32_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[30];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %32, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n120k32.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29},"
+      " %30,"
+      " %31,"
+      " p,   %33, %34;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x120x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x120x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x120x32_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[30];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %35, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n120k32.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29},"
+      "{%30, %31, %32, %33},"
+      " %34,"
+      " p,   %36, %37;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x120x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x120x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x120x32_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[60];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %62, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n120k32.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      " %60,"
+      " %61,"
+      " p,    %63,  %64;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x120x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x120x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x120x32_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[60];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %65, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n120k32.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      "{%60,  %61,  %62,  %63},"
+      " %64,"
+      " p,    %66,  %67;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x120x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x136x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x136x32_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[34];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %36, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n136k32.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33},"
+      " %34,"
+      " %35,"
+      " p,   %37, %38;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x136x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x136x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x136x32_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[34];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %39, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n136k32.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33},"
+      "{%34, %35, %36, %37},"
+      " %38,"
+      " p,   %40, %41;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x136x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x136x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x136x32_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[68];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %70, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n136k32.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67},"
+      " %68,"
+      " %69,"
+      " p,    %71,  %72;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x136x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x136x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x136x32_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[68];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %73, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n136k32.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67},"
+      "{%68,  %69,  %70,  %71},"
+      " %72,"
+      " p,    %74,  %75;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x136x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x144x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x144x32_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[36];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %38, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n144k32.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      " %36,"
+      " %37,"
+      " p,   %39, %40;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x144x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x144x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x144x32_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[36];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %41, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n144k32.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      "{%36, %37, %38, %39},"
+      " %40,"
+      " p,   %42, %43;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x144x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x144x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x144x32_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %74, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n144k32.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      " %72,"
+      " %73,"
+      " p,    %75,  %76;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x144x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x144x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x144x32_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %77, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n144k32.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      "{%72,  %73,  %74,  %75},"
+      " %76,"
+      " p,    %78,  %79;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x144x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x152x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x152x32_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[38];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %40, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n152k32.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37},"
+      " %38,"
+      " %39,"
+      " p,   %41, %42;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x152x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x152x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x152x32_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[38];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %43, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n152k32.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37},"
+      "{%38, %39, %40, %41},"
+      " %42,"
+      " p,   %44, %45;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x152x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x152x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x152x32_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[76];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %78, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n152k32.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75},"
+      " %76,"
+      " %77,"
+      " p,    %79,  %80;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x152x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x152x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x152x32_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[76];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %81, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n152k32.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75},"
+      "{%76,  %77,  %78,  %79},"
+      " %80,"
+      " p,    %82,  %83;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x152x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x160x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x160x32_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %42, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n160k32.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      " %40,"
+      " %41,"
+      " p,   %43, %44;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x160x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x160x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x160x32_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %45, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n160k32.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      "{%40, %41, %42, %43},"
+      " %44,"
+      " p,   %46, %47;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x160x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x160x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x160x32_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %82, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n160k32.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      " %80,"
+      " %81,"
+      " p,    %83,  %84;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x160x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x160x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x160x32_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %85, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n160k32.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      "{%80,  %81,  %82,  %83},"
+      " %84,"
+      " p,    %86,  %87;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x160x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x168x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x168x32_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[42];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %44, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n168k32.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41},"
+      " %42,"
+      " %43,"
+      " p,   %45, %46;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x168x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x168x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x168x32_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[42];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %47, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n168k32.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41},"
+      "{%42, %43, %44, %45},"
+      " %46,"
+      " p,   %48, %49;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x168x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x168x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x168x32_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[84];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %86, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n168k32.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83},"
+      " %84,"
+      " %85,"
+      " p,    %87,  %88;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x168x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x168x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x168x32_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[84];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %89, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n168k32.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83},"
+      "{%84,  %85,  %86,  %87},"
+      " %88,"
+      " p,    %90,  %91;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x168x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x176x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x176x32_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[44];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %46, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n176k32.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      " %44,"
+      " %45,"
+      " p,   %47, %48;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x176x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x176x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x176x32_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[44];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %49, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n176k32.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      "{%44, %45, %46, %47},"
+      " %48,"
+      " p,   %50, %51;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x176x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x176x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x176x32_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %90, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n176k32.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      " %88,"
+      " %89,"
+      " p,    %91,  %92;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x176x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x176x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x176x32_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %93, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n176k32.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      "{%88,  %89,  %90,  %91},"
+      " %92,"
+      " p,    %94,  %95;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x176x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x184x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x184x32_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[46];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %48, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n184k32.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45},"
+      " %46,"
+      " %47,"
+      " p,   %49, %50;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x184x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x184x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x184x32_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[46];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %51, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n184k32.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45},"
+      "{%46, %47, %48, %49},"
+      " %50,"
+      " p,   %52, %53;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x184x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x184x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x184x32_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[92];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %94, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n184k32.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91},"
+      " %92,"
+      " %93,"
+      " p,    %95,  %96;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x184x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x184x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x184x32_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[92];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %97, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n184k32.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91},"
+      "{%92,  %93,  %94,  %95},"
+      " %96,"
+      " p,    %98,  %99;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x184x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x200x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x200x32_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[50];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %52, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n200k32.f16.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49},"
+      " %50,"
+      " %51,"
+      " p,    %53,  %54;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x200x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x200x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x200x32_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[50];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %55, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n200k32.f16.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49},"
+      "{%50,  %51,  %52,  %53},"
+      " %54,"
+      " p,    %56,  %57;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x200x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x200x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x200x32_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[100];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %102, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n200k32.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99},"
+      " %100,"
+      " %101,"
+      " p,    %103, %104;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x200x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x200x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x200x32_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[100];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %105, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n200k32.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99},"
+      "{%100, %101, %102, %103},"
+      " %104,"
+      " p,    %106, %107;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x200x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x208x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x208x32_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[52];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %54, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n208k32.f16.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      " %52,"
+      " %53,"
+      " p,    %55,  %56;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x208x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x208x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x208x32_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[52];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %57, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n208k32.f16.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      "{%52,  %53,  %54,  %55},"
+      " %56,"
+      " p,    %58,  %59;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x208x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x208x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x208x32_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %106, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n208k32.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      " %104,"
+      " %105,"
+      " p,    %107, %108;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x208x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x208x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x208x32_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %109, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n208k32.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      "{%104, %105, %106, %107},"
+      " %108,"
+      " p,    %110, %111;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x208x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x216x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x216x32_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[54];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %56, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n216k32.f16.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53},"
+      " %54,"
+      " %55,"
+      " p,    %57,  %58;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x216x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x216x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x216x32_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[54];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %59, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n216k32.f16.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53},"
+      "{%54,  %55,  %56,  %57},"
+      " %58,"
+      " p,    %60,  %61;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x216x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x216x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x216x32_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[108];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %110, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n216k32.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107},"
+      " %108,"
+      " %109,"
+      " p,    %111, %112;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x216x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x216x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x216x32_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[108];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %113, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n216k32.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107},"
+      "{%108, %109, %110, %111},"
+      " %112,"
+      " p,    %114, %115;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x216x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x224x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x224x32_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %58, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n224k32.f16.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      " %56,"
+      " %57,"
+      " p,    %59,  %60;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x224x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x224x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x224x32_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %61, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n224k32.f16.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      "{%56,  %57,  %58,  %59},"
+      " %60,"
+      " p,    %62,  %63;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x224x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x224x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x224x32_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %114, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n224k32.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      " %112,"
+      " %113,"
+      " p,    %115, %116;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x224x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x224x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x224x32_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %117, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n224k32.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      "{%112, %113, %114, %115},"
+      " %116,"
+      " p,    %118, %119;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x224x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x232x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x232x32_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[58];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %60, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n232k32.f16.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57},"
+      " %58,"
+      " %59,"
+      " p,    %61,  %62;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x232x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x232x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x232x32_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[58];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %63, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n232k32.f16.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57},"
+      "{%58,  %59,  %60,  %61},"
+      " %62,"
+      " p,    %64,  %65;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x232x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x232x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x232x32_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[116];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %118, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n232k32.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115},"
+      " %116,"
+      " %117,"
+      " p,    %119, %120;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x232x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x232x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x232x32_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[116];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %121, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n232k32.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115},"
+      "{%116, %117, %118, %119},"
+      " %120,"
+      " p,    %122, %123;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x232x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x240x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x240x32_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[60];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %62, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n240k32.f16.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      " %60,"
+      " %61,"
+      " p,    %63,  %64;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x240x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x240x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x240x32_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[60];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %65, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n240k32.f16.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      "{%60,  %61,  %62,  %63},"
+      " %64,"
+      " p,    %66,  %67;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x240x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x240x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x240x32_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %122, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n240k32.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      " %120,"
+      " %121,"
+      " p,    %123, %124;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x240x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x240x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x240x32_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %125, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n240k32.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      "{%120, %121, %122, %123},"
+      " %124,"
+      " p,    %126, %127;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x240x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x248x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x248x32_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[62];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %64, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n248k32.f16.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61},"
+      " %62,"
+      " %63,"
+      " p,    %65,  %66;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x248x32_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x248x32 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x248x32_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[62];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %67, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n248k32.f16.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61},"
+      "{%62,  %63,  %64,  %65},"
+      " %66,"
+      " p,    %68,  %69;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x248x32_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x248x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x248x32_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[124];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %126, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n248k32.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123},"
+      " %124,"
+      " %125,"
+      " p,    %127, %128;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x248x32_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x248x32 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x248x32_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[124];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %129, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n248k32.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123},"
+      "{%124, %125, %126, %127},"
+      " %128,"
+      " p,    %130, %131;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x248x32_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x24x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x24x32_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[6];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %8, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n24k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5},"
+      " %6,"
+      " %7,"
+      " p,   %9,  %10;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x24x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x24x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x24x32_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[6];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %11, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n24k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5},"
+      "{%6,  %7,  %8,  %9},"
+      " %10,"
+      " p,   %12, %13;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x24x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x24x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x24x32_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %14, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n24k32.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      " %12,"
+      " %13,"
+      " p,   %15, %16;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x24x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x24x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x24x32_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %17, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n24k32.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      "{%12, %13, %14, %15},"
+      " %16,"
+      " p,   %18, %19;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x24x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x40x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x40x32_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[10];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %12, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n40k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9},"
+      " %10,"
+      " %11,"
+      " p,   %13, %14;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x40x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x40x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x40x32_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[10];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %15, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n40k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9},"
+      "{%10, %11, %12, %13},"
+      " %14,"
+      " p,   %16, %17;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x40x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x40x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x40x32_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[20];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %22, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n40k32.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      " %20,"
+      " %21,"
+      " p,   %23, %24;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x40x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x40x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x40x32_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[20];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %25, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n40k32.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      "{%20, %21, %22, %23},"
+      " %24,"
+      " p,   %26, %27;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x40x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x48x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x48x32_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %14, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n48k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      " %12,"
+      " %13,"
+      " p,   %15, %16;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x48x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x48x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x48x32_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %17, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n48k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      "{%12, %13, %14, %15},"
+      " %16,"
+      " p,   %18, %19;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x48x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x48x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x48x32_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %26, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n48k32.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      " %24,"
+      " %25,"
+      " p,   %27, %28;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x48x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x48x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x48x32_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %29, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n48k32.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      "{%24, %25, %26, %27},"
+      " %28,"
+      " p,   %30, %31;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x48x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x56x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x56x32_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[14];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %16, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n56k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13},"
+      " %14,"
+      " %15,"
+      " p,   %17, %18;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x56x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x56x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x56x32_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[14];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %19, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n56k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13},"
+      "{%14, %15, %16, %17},"
+      " %18,"
+      " p,   %20, %21;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x56x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x56x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x56x32_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[28];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %30, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n56k32.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      " %28,"
+      " %29,"
+      " p,   %31, %32;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x56x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x56x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x56x32_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[28];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %33, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n56k32.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      "{%28, %29, %30, %31},"
+      " %32,"
+      " p,   %34, %35;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x56x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x72x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x72x32_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[18];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %20, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n72k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17},"
+      " %18,"
+      " %19,"
+      " p,   %21, %22;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x72x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x72x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x72x32_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[18];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %23, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n72k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17},"
+      "{%18, %19, %20, %21},"
+      " %22,"
+      " p,   %24, %25;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x72x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x72x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x72x32_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[36];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %38, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n72k32.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      " %36,"
+      " %37,"
+      " p,   %39, %40;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x72x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x72x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x72x32_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[36];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %41, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n72k32.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      "{%36, %37, %38, %39},"
+      " %40,"
+      " p,   %42, %43;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x72x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x80x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x80x32_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[20];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %22, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n80k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      " %20,"
+      " %21,"
+      " p,   %23, %24;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x80x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x80x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x80x32_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[20];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %25, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n80k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      "{%20, %21, %22, %23},"
+      " %24,"
+      " p,   %26, %27;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x80x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x80x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x80x32_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %42, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n80k32.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      " %40,"
+      " %41,"
+      " p,   %43, %44;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x80x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x80x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x80x32_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %45, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n80k32.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      "{%40, %41, %42, %43},"
+      " %44,"
+      " p,   %46, %47;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x80x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x88x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x88x32_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[22];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %24, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n88k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21},"
+      " %22,"
+      " %23,"
+      " p,   %25, %26;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x88x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x88x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x88x32_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[22];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %27, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n88k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21},"
+      "{%22, %23, %24, %25},"
+      " %26,"
+      " p,   %28, %29;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x88x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x88x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x88x32_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[44];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %46, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n88k32.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      " %44,"
+      " %45,"
+      " p,   %47, %48;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x88x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x88x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x88x32_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[44];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %49, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n88k32.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      "{%44, %45, %46, %47},"
+      " %48,"
+      " p,   %50, %51;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x88x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x104x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x104x32_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[26];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %28, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n104k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25},"
+      " %26,"
+      " %27,"
+      " p,   %29, %30;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x104x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x104x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x104x32_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[26];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %31, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n104k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25},"
+      "{%26, %27, %28, %29},"
+      " %30,"
+      " p,   %32, %33;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x104x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x104x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x104x32_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[52];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %54, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n104k32.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      " %52,"
+      " %53,"
+      " p,    %55,  %56;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x104x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x104x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x104x32_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[52];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %57, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n104k32.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      "{%52,  %53,  %54,  %55},"
+      " %56,"
+      " p,    %58,  %59;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x104x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x112x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x112x32_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[28];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %30, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n112k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      " %28,"
+      " %29,"
+      " p,   %31, %32;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x112x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x112x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x112x32_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[28];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %33, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n112k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      "{%28, %29, %30, %31},"
+      " %32,"
+      " p,   %34, %35;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x112x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x112x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x112x32_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %58, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n112k32.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      " %56,"
+      " %57,"
+      " p,    %59,  %60;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x112x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x112x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x112x32_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %61, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n112k32.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      "{%56,  %57,  %58,  %59},"
+      " %60,"
+      " p,    %62,  %63;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x112x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x120x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x120x32_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[30];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %32, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n120k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29},"
+      " %30,"
+      " %31,"
+      " p,   %33, %34;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x120x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x120x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x120x32_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[30];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %35, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n120k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29},"
+      "{%30, %31, %32, %33},"
+      " %34,"
+      " p,   %36, %37;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x120x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x120x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x120x32_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[60];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %62, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n120k32.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      " %60,"
+      " %61,"
+      " p,    %63,  %64;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x120x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x120x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x120x32_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[60];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %65, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n120k32.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      "{%60,  %61,  %62,  %63},"
+      " %64,"
+      " p,    %66,  %67;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x120x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x136x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x136x32_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[34];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %36, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n136k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33},"
+      " %34,"
+      " %35,"
+      " p,   %37, %38;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x136x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x136x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x136x32_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[34];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %39, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n136k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33},"
+      "{%34, %35, %36, %37},"
+      " %38,"
+      " p,   %40, %41;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x136x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x136x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x136x32_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[68];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %70, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n136k32.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67},"
+      " %68,"
+      " %69,"
+      " p,    %71,  %72;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x136x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x136x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x136x32_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[68];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %73, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n136k32.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67},"
+      "{%68,  %69,  %70,  %71},"
+      " %72,"
+      " p,    %74,  %75;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x136x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x144x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x144x32_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[36];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %38, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n144k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      " %36,"
+      " %37,"
+      " p,   %39, %40;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x144x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x144x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x144x32_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[36];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %41, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n144k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      "{%36, %37, %38, %39},"
+      " %40,"
+      " p,   %42, %43;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x144x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x144x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x144x32_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %74, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n144k32.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      " %72,"
+      " %73,"
+      " p,    %75,  %76;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x144x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x144x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x144x32_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %77, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n144k32.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      "{%72,  %73,  %74,  %75},"
+      " %76,"
+      " p,    %78,  %79;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x144x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x152x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x152x32_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[38];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %40, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n152k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37},"
+      " %38,"
+      " %39,"
+      " p,   %41, %42;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x152x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x152x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x152x32_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[38];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %43, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n152k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37},"
+      "{%38, %39, %40, %41},"
+      " %42,"
+      " p,   %44, %45;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x152x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x152x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x152x32_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[76];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %78, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n152k32.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75},"
+      " %76,"
+      " %77,"
+      " p,    %79,  %80;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x152x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x152x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x152x32_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[76];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %81, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n152k32.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75},"
+      "{%76,  %77,  %78,  %79},"
+      " %80,"
+      " p,    %82,  %83;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x152x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x160x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x160x32_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %42, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n160k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      " %40,"
+      " %41,"
+      " p,   %43, %44;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x160x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x160x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x160x32_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %45, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n160k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      "{%40, %41, %42, %43},"
+      " %44,"
+      " p,   %46, %47;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x160x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x160x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x160x32_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %82, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n160k32.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      " %80,"
+      " %81,"
+      " p,    %83,  %84;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x160x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x160x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x160x32_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %85, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n160k32.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      "{%80,  %81,  %82,  %83},"
+      " %84,"
+      " p,    %86,  %87;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x160x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x168x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x168x32_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[42];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %44, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n168k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41},"
+      " %42,"
+      " %43,"
+      " p,   %45, %46;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x168x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x168x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x168x32_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[42];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %47, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n168k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41},"
+      "{%42, %43, %44, %45},"
+      " %46,"
+      " p,   %48, %49;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x168x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x168x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x168x32_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[84];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %86, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n168k32.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83},"
+      " %84,"
+      " %85,"
+      " p,    %87,  %88;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x168x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x168x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x168x32_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[84];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %89, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n168k32.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83},"
+      "{%84,  %85,  %86,  %87},"
+      " %88,"
+      " p,    %90,  %91;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x168x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x176x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x176x32_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[44];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %46, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n176k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      " %44,"
+      " %45,"
+      " p,   %47, %48;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x176x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x176x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x176x32_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[44];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %49, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n176k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      "{%44, %45, %46, %47},"
+      " %48,"
+      " p,   %50, %51;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x176x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x176x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x176x32_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %90, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n176k32.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      " %88,"
+      " %89,"
+      " p,    %91,  %92;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x176x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x176x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x176x32_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %93, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n176k32.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      "{%88,  %89,  %90,  %91},"
+      " %92,"
+      " p,    %94,  %95;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x176x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x184x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x184x32_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[46];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %48, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n184k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45},"
+      " %46,"
+      " %47,"
+      " p,   %49, %50;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x184x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x184x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x184x32_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[46];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %51, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n184k32.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45},"
+      "{%46, %47, %48, %49},"
+      " %50,"
+      " p,   %52, %53;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x184x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x184x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x184x32_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[92];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %94, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n184k32.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91},"
+      " %92,"
+      " %93,"
+      " p,    %95,  %96;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x184x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x184x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x184x32_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[92];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %97, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n184k32.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91},"
+      "{%92,  %93,  %94,  %95},"
+      " %96,"
+      " p,    %98,  %99;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x184x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x200x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x200x32_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[50];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %52, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n200k32.f16.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49},"
+      " %50,"
+      " %51,"
+      " p,    %53,  %54;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x200x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x200x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x200x32_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[50];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %55, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n200k32.f16.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49},"
+      "{%50,  %51,  %52,  %53},"
+      " %54,"
+      " p,    %56,  %57;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x200x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x200x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x200x32_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[100];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %102, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n200k32.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99},"
+      " %100,"
+      " %101,"
+      " p,    %103, %104;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x200x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x200x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x200x32_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[100];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %105, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n200k32.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99},"
+      "{%100, %101, %102, %103},"
+      " %104,"
+      " p,    %106, %107;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x200x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x208x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x208x32_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[52];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %54, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n208k32.f16.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      " %52,"
+      " %53,"
+      " p,    %55,  %56;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x208x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x208x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x208x32_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[52];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %57, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n208k32.f16.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      "{%52,  %53,  %54,  %55},"
+      " %56,"
+      " p,    %58,  %59;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x208x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x208x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x208x32_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %106, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n208k32.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      " %104,"
+      " %105,"
+      " p,    %107, %108;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x208x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x208x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x208x32_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %109, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n208k32.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      "{%104, %105, %106, %107},"
+      " %108,"
+      " p,    %110, %111;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x208x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x216x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x216x32_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[54];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %56, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n216k32.f16.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53},"
+      " %54,"
+      " %55,"
+      " p,    %57,  %58;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x216x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x216x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x216x32_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[54];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %59, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n216k32.f16.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53},"
+      "{%54,  %55,  %56,  %57},"
+      " %58,"
+      " p,    %60,  %61;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x216x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x216x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x216x32_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[108];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %110, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n216k32.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107},"
+      " %108,"
+      " %109,"
+      " p,    %111, %112;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x216x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x216x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x216x32_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[108];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %113, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n216k32.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107},"
+      "{%108, %109, %110, %111},"
+      " %112,"
+      " p,    %114, %115;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x216x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x224x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x224x32_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %58, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n224k32.f16.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      " %56,"
+      " %57,"
+      " p,    %59,  %60;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x224x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x224x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x224x32_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %61, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n224k32.f16.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      "{%56,  %57,  %58,  %59},"
+      " %60,"
+      " p,    %62,  %63;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x224x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x224x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x224x32_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %114, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n224k32.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      " %112,"
+      " %113,"
+      " p,    %115, %116;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x224x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x224x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x224x32_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %117, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n224k32.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      "{%112, %113, %114, %115},"
+      " %116,"
+      " p,    %118, %119;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x224x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x232x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x232x32_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[58];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %60, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n232k32.f16.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57},"
+      " %58,"
+      " %59,"
+      " p,    %61,  %62;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x232x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x232x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x232x32_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[58];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %63, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n232k32.f16.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57},"
+      "{%58,  %59,  %60,  %61},"
+      " %62,"
+      " p,    %64,  %65;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x232x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x232x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x232x32_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[116];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %118, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n232k32.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115},"
+      " %116,"
+      " %117,"
+      " p,    %119, %120;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x232x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x232x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x232x32_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[116];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %121, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n232k32.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115},"
+      "{%116, %117, %118, %119},"
+      " %120,"
+      " p,    %122, %123;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x232x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x240x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x240x32_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[60];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %62, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n240k32.f16.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      " %60,"
+      " %61,"
+      " p,    %63,  %64;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x240x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x240x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x240x32_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[60];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %65, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n240k32.f16.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      "{%60,  %61,  %62,  %63},"
+      " %64,"
+      " p,    %66,  %67;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x240x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x240x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x240x32_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %122, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n240k32.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      " %120,"
+      " %121,"
+      " p,    %123, %124;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x240x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x240x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x240x32_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %125, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n240k32.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      "{%120, %121, %122, %123},"
+      " %124,"
+      " p,    %126, %127;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x240x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x248x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x248x32_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[62];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %64, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n248k32.f16.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61},"
+      " %62,"
+      " %63,"
+      " p,    %65,  %66;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x248x32_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x248x32 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x248x32_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[62];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %67, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n248k32.f16.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61},"
+      "{%62,  %63,  %64,  %65},"
+      " %66,"
+      " p,    %68,  %69;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x248x32_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x248x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x248x32_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[124];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %126, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n248k32.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123},"
+      " %124,"
+      " %125,"
+      " p,    %127, %128;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x248x32_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA 64x248x32 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+struct MMA_64x248x32_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[124];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %129, 0;\n"
+      "wgmma.mma_async.sync.aligned.m64n248k32.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123},"
+      "{%124, %125, %126, %127},"
+      " %128,"
+      " p,    %130, %131;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use MMA_64x248x32_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+} // namespace SM90::GMMA
+
+} // namespace cute
diff --git a/include/cute/arch/mma_sm90_gmma_sparse.hpp b/include/cute/arch/mma_sm90_gmma_sparse.hpp
new file mode 100644
index 0000000000..ecca91b93c
--- /dev/null
+++ b/include/cute/arch/mma_sm90_gmma_sparse.hpp
@@ -0,0 +1,22743 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+#pragma once
+
+#include <cute/config.hpp>                 // CUTE_HOST_DEVICE
+#include <cute/arch/mma_sm90_gmma.hpp>     // GMMA::Major, etc.
+
+namespace cute {
+
+namespace SM90::GMMA::SPARSE {
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+// GMMA PTX definitions:  C = (scaleA * A) * (scaleB * B) + (scaleD * C)
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x8x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x8x32_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[2];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %6, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n8k32.f16.f16.f16 "
+      "{%0, %1},"
+      " %2,"
+      " %3,"
+      " %4, %5,"
+      " p,  %7, %8, %9, %10;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x8x32_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x8x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x8x32_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[2];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %9, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n8k32.f16.f16.f16 "
+      "{%0,  %1},"
+      "{%2,  %3,  %4,  %5},"
+      " %6,"
+      " %7, %8,"
+      " p,   %10, %11, %12;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x8x32_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x16x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x16x32_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %8, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n16k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3},"
+      " %4,"
+      " %5,"
+      " %6, %7,"
+      " p,   %9,  %10, %11, %12;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x16x32_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x16x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x16x32_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[4];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %11, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n16k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3},"
+      "{%4,  %5,  %6,  %7},"
+      " %8,"
+      " %9, %10,"
+      " p,   %12, %13, %14;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x16x32_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x32x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x32x32_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[8];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %12, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n32k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      " %8,"
+      " %9,"
+      " %10, %11,"
+      " p,   %13, %14, %15, %16;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x32x32_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x32x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x32x32_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[8];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %15, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n32k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      "{%8,  %9,  %10, %11},"
+      " %12,"
+      " %13, %14,"
+      " p,   %16, %17, %18;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x32x32_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x64x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x64x32_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[16];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %20, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n64k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      " %16,"
+      " %17,"
+      " %18, %19,"
+      " p,   %21, %22, %23, %24;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x64x32_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x64x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x64x32_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[16];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %23, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n64k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      "{%16, %17, %18, %19},"
+      " %20,"
+      " %21, %22,"
+      " p,   %24, %25, %26;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x64x32_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x96x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x96x32_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %28, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n96k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      " %24,"
+      " %25,"
+      " %26, %27,"
+      " p,   %29, %30, %31, %32;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x96x32_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x96x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x96x32_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[24];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %31, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n96k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      "{%24, %25, %26, %27},"
+      " %28,"
+      " %29, %30,"
+      " p,   %32, %33, %34;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x96x32_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x128x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x128x32_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[32];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %36, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n128k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      " %32,"
+      " %33,"
+      " %34, %35,"
+      " p,   %37, %38, %39, %40;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x128x32_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x128x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x128x32_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[32];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %39, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n128k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      "{%32, %33, %34, %35},"
+      " %36,"
+      " %37, %38,"
+      " p,   %40, %41, %42;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x128x32_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x192x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x192x32_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[48];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %52, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n192k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45, %46, %47},"
+      " %48,"
+      " %49,"
+      " %50, %51,"
+      " p,   %53, %54, %55, %56;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x192x32_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x192x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x192x32_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[48];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %55, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n192k32.f16.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
+      "{%48,  %49,  %50,  %51},"
+      " %52,"
+      " %53, %54,"
+      " p,    %56,  %57,  %58;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x192x32_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x256x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x256x32_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[64];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %68, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n256k32.f16.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      " %64,"
+      " %65,"
+      " %66, %67,"
+      " p,    %69,  %70,  %71,  %72;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x256x32_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x256x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x256x32_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[64];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %71, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n256k32.f16.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      "{%64,  %65,  %66,  %67},"
+      " %68,"
+      " %69, %70,"
+      " p,    %72,  %73,  %74;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x256x32_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x8x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x8x32_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d0, float         & d1, float         & d2, float         & d3,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %8, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n8k32.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3},"
+      " %4,"
+      " %5,"
+      " %6, %7,"
+      " p,   %9,  %10, %11, %12;\n"
+    "}\n"
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x8x32_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x8x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x8x32_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[4];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      float         & d0, float         & d1, float         & d2, float         & d3,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %11, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n8k32.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3},"
+      "{%4,  %5,  %6,  %7},"
+      " %8,"
+      " %9, %10,"
+      " p,   %12, %13, %14;\n"
+    "}\n"
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x8x32_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x16x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x16x32_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[8];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d0, float         & d1, float         & d2, float         & d3,
+      float         & d4, float         & d5, float         & d6, float         & d7,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %12, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n16k32.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      " %8,"
+      " %9,"
+      " %10, %11,"
+      " p,   %13, %14, %15, %16;\n"
+    "}\n"
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3),
+        "+f"(d4), "+f"(d5), "+f"(d6), "+f"(d7)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x16x32_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x16x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x16x32_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[8];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      float         & d0, float         & d1, float         & d2, float         & d3,
+      float         & d4, float         & d5, float         & d6, float         & d7,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %15, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n16k32.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      "{%8,  %9,  %10, %11},"
+      " %12,"
+      " %13, %14,"
+      " p,   %16, %17, %18;\n"
+    "}\n"
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3),
+        "+f"(d4), "+f"(d5), "+f"(d6), "+f"(d7)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x16x32_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x32x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x32x32_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[16];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %20, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n32k32.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      " %16,"
+      " %17,"
+      " %18, %19,"
+      " p,   %21, %22, %23, %24;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x32x32_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x32x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x32x32_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[16];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %23, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n32k32.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      "{%16, %17, %18, %19},"
+      " %20,"
+      " %21, %22,"
+      " p,   %24, %25, %26;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x32x32_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x64x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x64x32_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[32];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %36, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n64k32.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      " %32,"
+      " %33,"
+      " %34, %35,"
+      " p,   %37, %38, %39, %40;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x64x32_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x64x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x64x32_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[32];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %39, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n64k32.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      "{%32, %33, %34, %35},"
+      " %36,"
+      " %37, %38,"
+      " p,   %40, %41, %42;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x64x32_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x96x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x96x32_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[48];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %52, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n96k32.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45, %46, %47},"
+      " %48,"
+      " %49,"
+      " %50, %51,"
+      " p,   %53, %54, %55, %56;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x96x32_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x96x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x96x32_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[48];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %55, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n96k32.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
+      "{%48,  %49,  %50,  %51},"
+      " %52,"
+      " %53, %54,"
+      " p,    %56,  %57,  %58;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x96x32_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x128x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x128x32_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[64];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %68, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n128k32.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      " %64,"
+      " %65,"
+      " %66, %67,"
+      " p,    %69,  %70,  %71,  %72;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x128x32_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x128x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x128x32_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[64];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %71, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n128k32.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      "{%64,  %65,  %66,  %67},"
+      " %68,"
+      " %69, %70,"
+      " p,    %72,  %73,  %74;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x128x32_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x192x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x192x32_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[96];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      float         & d92, float         & d93, float         & d94, float         & d95,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %100, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n192k32.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      " %96,"
+      " %97,"
+      " %98, %99,"
+      " p,    %101, %102, %103, %104;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91),
+        "+f"(d92), "+f"(d93), "+f"(d94), "+f"(d95)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x192x32_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x192x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x192x32_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[96];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      float         & d92, float         & d93, float         & d94, float         & d95,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %103, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n192k32.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      "{%96,  %97,  %98,  %99},"
+      " %100,"
+      " %101, %102,"
+      " p,    %104, %105, %106;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91),
+        "+f"(d92), "+f"(d93), "+f"(d94), "+f"(d95)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x192x32_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x256x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x256x32_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[128];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      float         & d124, float         & d125, float         & d126, float         & d127,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %132, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n256k32.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      " %128,"
+      " %129,"
+      " %130, %131,"
+      " p,    %133, %134, %135, %136;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123),
+        "+f"(d124), "+f"(d125), "+f"(d126), "+f"(d127)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x256x32_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x256x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x256x32_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[128];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      float         & d124, float         & d125, float         & d126, float         & d127,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %135, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n256k32.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      "{%128, %129, %130, %131},"
+      " %132,"
+      " %133, %134,"
+      " p,    %136, %137, %138;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123),
+        "+f"(d124), "+f"(d125), "+f"(d126), "+f"(d127)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x256x32_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x8x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x8x32_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d0, float         & d1, float         & d2, float         & d3,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %8, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n8k32.f32.bf16.bf16 "
+      "{%0,  %1,  %2,  %3},"
+      " %4,"
+      " %5,"
+      " %6, %7,"
+      " p,   %9,  %10, %11, %12;\n"
+    "}\n"
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x8x32_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x8x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x8x32_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[4];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      float         & d0, float         & d1, float         & d2, float         & d3,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %11, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n8k32.f32.bf16.bf16 "
+      "{%0,  %1,  %2,  %3},"
+      "{%4,  %5,  %6,  %7},"
+      " %8,"
+      " %9, %10,"
+      " p,   %12, %13, %14;\n"
+    "}\n"
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x8x32_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x16x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x16x32_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[8];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d0, float         & d1, float         & d2, float         & d3,
+      float         & d4, float         & d5, float         & d6, float         & d7,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %12, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n16k32.f32.bf16.bf16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      " %8,"
+      " %9,"
+      " %10, %11,"
+      " p,   %13, %14, %15, %16;\n"
+    "}\n"
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3),
+        "+f"(d4), "+f"(d5), "+f"(d6), "+f"(d7)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x16x32_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x16x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x16x32_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[8];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      float         & d0, float         & d1, float         & d2, float         & d3,
+      float         & d4, float         & d5, float         & d6, float         & d7,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %15, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n16k32.f32.bf16.bf16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      "{%8,  %9,  %10, %11},"
+      " %12,"
+      " %13, %14,"
+      " p,   %16, %17, %18;\n"
+    "}\n"
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3),
+        "+f"(d4), "+f"(d5), "+f"(d6), "+f"(d7)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x16x32_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x32x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x32x32_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[16];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %20, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n32k32.f32.bf16.bf16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      " %16,"
+      " %17,"
+      " %18, %19,"
+      " p,   %21, %22, %23, %24;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x32x32_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x32x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x32x32_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[16];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %23, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n32k32.f32.bf16.bf16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      "{%16, %17, %18, %19},"
+      " %20,"
+      " %21, %22,"
+      " p,   %24, %25, %26;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x32x32_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x64x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x64x32_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[32];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %36, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n64k32.f32.bf16.bf16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      " %32,"
+      " %33,"
+      " %34, %35,"
+      " p,   %37, %38, %39, %40;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x64x32_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x64x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x64x32_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[32];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %39, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n64k32.f32.bf16.bf16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      "{%32, %33, %34, %35},"
+      " %36,"
+      " %37, %38,"
+      " p,   %40, %41, %42;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x64x32_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x96x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x96x32_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[48];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %52, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n96k32.f32.bf16.bf16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45, %46, %47},"
+      " %48,"
+      " %49,"
+      " %50, %51,"
+      " p,   %53, %54, %55, %56;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x96x32_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x96x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x96x32_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[48];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %55, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n96k32.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
+      "{%48,  %49,  %50,  %51},"
+      " %52,"
+      " %53, %54,"
+      " p,    %56,  %57,  %58;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x96x32_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x128x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x128x32_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[64];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %68, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n128k32.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      " %64,"
+      " %65,"
+      " %66, %67,"
+      " p,    %69,  %70,  %71,  %72;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x128x32_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x128x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x128x32_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[64];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %71, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n128k32.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      "{%64,  %65,  %66,  %67},"
+      " %68,"
+      " %69, %70,"
+      " p,    %72,  %73,  %74;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x128x32_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x192x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x192x32_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[96];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      float         & d92, float         & d93, float         & d94, float         & d95,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %100, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n192k32.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      " %96,"
+      " %97,"
+      " %98, %99,"
+      " p,    %101, %102, %103, %104;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91),
+        "+f"(d92), "+f"(d93), "+f"(d94), "+f"(d95)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x192x32_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x192x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x192x32_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[96];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      float         & d92, float         & d93, float         & d94, float         & d95,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %103, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n192k32.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      "{%96,  %97,  %98,  %99},"
+      " %100,"
+      " %101, %102,"
+      " p,    %104, %105, %106;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91),
+        "+f"(d92), "+f"(d93), "+f"(d94), "+f"(d95)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x192x32_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x256x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x256x32_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[128];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      float         & d124, float         & d125, float         & d126, float         & d127,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %132, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n256k32.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      " %128,"
+      " %129,"
+      " %130, %131,"
+      " p,    %133, %134, %135, %136;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123),
+        "+f"(d124), "+f"(d125), "+f"(d126), "+f"(d127)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x256x32_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x256x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x256x32_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[128];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      float         & d124, float         & d125, float         & d126, float         & d127,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %135, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n256k32.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      "{%128, %129, %130, %131},"
+      " %132,"
+      " %133, %134,"
+      " p,    %136, %137, %138;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123),
+        "+f"(d124), "+f"(d125), "+f"(d126), "+f"(d127)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x256x32_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x8x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x8x16_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d0, float         & d1, float         & d2, float         & d3,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %8, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n8k16.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3},"
+      " %4,"
+      " %5,"
+      " %6, %7,"
+      " p,   %9,  %10;\n"
+    "}\n"
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x8x16_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x8x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x8x16_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      float         & d0, float         & d1, float         & d2, float         & d3,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %11, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n8k16.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3},"
+      "{%4,  %5,  %6,  %7},"
+      " %8,"
+      " %9, %10,"
+      " p,   %12, %13;\n"
+    "}\n"
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x8x16_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x16x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x16x16_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[8];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d0, float         & d1, float         & d2, float         & d3,
+      float         & d4, float         & d5, float         & d6, float         & d7,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %12, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n16k16.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      " %8,"
+      " %9,"
+      " %10, %11,"
+      " p,   %13, %14;\n"
+    "}\n"
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3),
+        "+f"(d4), "+f"(d5), "+f"(d6), "+f"(d7)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x16x16_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x16x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x16x16_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[8];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      float         & d0, float         & d1, float         & d2, float         & d3,
+      float         & d4, float         & d5, float         & d6, float         & d7,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %15, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n16k16.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      "{%8,  %9,  %10, %11},"
+      " %12,"
+      " %13, %14,"
+      " p,   %16, %17;\n"
+    "}\n"
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3),
+        "+f"(d4), "+f"(d5), "+f"(d6), "+f"(d7)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x16x16_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x32x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x32x16_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[16];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %20, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n32k16.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      " %16,"
+      " %17,"
+      " %18, %19,"
+      " p,   %21, %22;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x32x16_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x32x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x32x16_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[16];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %23, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n32k16.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      "{%16, %17, %18, %19},"
+      " %20,"
+      " %21, %22,"
+      " p,   %24, %25;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x32x16_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x64x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x64x16_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[32];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %36, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n64k16.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      " %32,"
+      " %33,"
+      " %34, %35,"
+      " p,   %37, %38;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x64x16_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x64x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x64x16_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[32];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %39, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n64k16.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      "{%32, %33, %34, %35},"
+      " %36,"
+      " %37, %38,"
+      " p,   %40, %41;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x64x16_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x96x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x96x16_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[48];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %52, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n96k16.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45, %46, %47},"
+      " %48,"
+      " %49,"
+      " %50, %51,"
+      " p,   %53, %54;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x96x16_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x96x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x96x16_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[48];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %55, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n96k16.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
+      "{%48,  %49,  %50,  %51},"
+      " %52,"
+      " %53, %54,"
+      " p,    %56,  %57;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x96x16_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x128x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x128x16_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[64];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %68, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n128k16.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      " %64,"
+      " %65,"
+      " %66, %67,"
+      " p,    %69,  %70;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x128x16_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x128x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x128x16_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[64];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %71, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n128k16.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      "{%64,  %65,  %66,  %67},"
+      " %68,"
+      " %69, %70,"
+      " p,    %72,  %73;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x128x16_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x192x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x192x16_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[96];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      float         & d92, float         & d93, float         & d94, float         & d95,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %100, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n192k16.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      " %96,"
+      " %97,"
+      " %98, %99,"
+      " p,    %101, %102;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91),
+        "+f"(d92), "+f"(d93), "+f"(d94), "+f"(d95)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x192x16_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x192x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x192x16_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[96];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      float         & d92, float         & d93, float         & d94, float         & d95,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %103, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n192k16.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      "{%96,  %97,  %98,  %99},"
+      " %100,"
+      " %101, %102,"
+      " p,    %104, %105;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91),
+        "+f"(d92), "+f"(d93), "+f"(d94), "+f"(d95)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x192x16_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x256x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x256x16_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[128];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      float         & d124, float         & d125, float         & d126, float         & d127,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %132, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n256k16.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      " %128,"
+      " %129,"
+      " %130, %131,"
+      " p,    %133, %134;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123),
+        "+f"(d124), "+f"(d125), "+f"(d126), "+f"(d127)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x256x16_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x256x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x256x16_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[128];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      float         & d124, float         & d125, float         & d126, float         & d127,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %135, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n256k16.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      "{%128, %129, %130, %131},"
+      " %132,"
+      " %133, %134,"
+      " p,    %136, %137;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123),
+        "+f"(d124), "+f"(d125), "+f"(d126), "+f"(d127)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x256x16_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x8x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x8x64_S32S8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %8, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n8k64.s32.s8.s8 "
+      "{%0,  %1,  %2,  %3},"
+      " %4,"
+      " %5,"
+      " %6, %7,"
+      " p;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x8x64_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x8x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x8x64_S32S8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %8, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n8k64.s32.s8.s8.satfinite "
+      "{%0,  %1,  %2,  %3},"
+      " %4,"
+      " %5,"
+      " %6, %7,"
+      " p;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x8x64_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x16x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x16x64_S32S8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[8];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %12, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n16k64.s32.s8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      " %8,"
+      " %9,"
+      " %10, %11,"
+      " p;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x16x64_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x16x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x16x64_S32S8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[8];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %12, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n16k64.s32.s8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      " %8,"
+      " %9,"
+      " %10, %11,"
+      " p;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x16x64_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x32x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x32x64_S32S8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[16];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %20, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n32k64.s32.s8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      " %16,"
+      " %17,"
+      " %18, %19,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x32x64_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x32x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x32x64_S32S8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[16];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %20, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n32k64.s32.s8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      " %16,"
+      " %17,"
+      " %18, %19,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x32x64_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x64x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x64x64_S32S8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[32];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %36, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n64k64.s32.s8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      " %32,"
+      " %33,"
+      " %34, %35,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x64x64_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x64x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x64x64_S32S8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[32];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %36, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n64k64.s32.s8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      " %32,"
+      " %33,"
+      " %34, %35,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x64x64_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x96x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x96x64_S32S8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[48];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %52, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n96k64.s32.s8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45, %46, %47},"
+      " %48,"
+      " %49,"
+      " %50, %51,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x96x64_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x96x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x96x64_S32S8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[48];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %52, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n96k64.s32.s8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45, %46, %47},"
+      " %48,"
+      " %49,"
+      " %50, %51,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x96x64_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x128x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x128x64_S32S8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[64];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %68, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n128k64.s32.s8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      " %64,"
+      " %65,"
+      " %66, %67,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x128x64_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x128x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x128x64_S32S8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[64];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %68, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n128k64.s32.s8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      " %64,"
+      " %65,"
+      " %66, %67,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x128x64_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x192x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x192x64_S32S8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[96];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
+      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %100, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n192k64.s32.s8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      " %96,"
+      " %97,"
+      " %98, %99,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
+        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
+        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x192x64_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x192x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x192x64_S32S8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[96];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
+      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %100, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n192k64.s32.s8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      " %96,"
+      " %97,"
+      " %98, %99,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
+        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
+        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x192x64_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x256x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x256x64_S32S8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[128];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
+      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %132, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n256k64.s32.s8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      " %128,"
+      " %129,"
+      " %130, %131,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
+        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
+        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x256x64_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x256x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x256x64_S32S8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[128];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
+      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %132, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n256k64.s32.s8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      " %128,"
+      " %129,"
+      " %130, %131,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
+        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
+        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x256x64_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x8x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x8x64_S32S8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %11, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n8k64.s32.s8.s8 "
+      "{%0,  %1,  %2,  %3},"
+      "{%4,  %5,  %6,  %7},"
+      " %8,"
+      " %9, %10,"
+      " p;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x8x64_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x8x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x8x64_S32S8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %11, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n8k64.s32.s8.s8.satfinite "
+      "{%0,  %1,  %2,  %3},"
+      "{%4,  %5,  %6,  %7},"
+      " %8,"
+      " %9, %10,"
+      " p;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x8x64_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x16x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x16x64_S32S8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[8];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %15, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n16k64.s32.s8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      "{%8,  %9,  %10, %11},"
+      " %12,"
+      " %13, %14,"
+      " p;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x16x64_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x16x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x16x64_S32S8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[8];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %15, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n16k64.s32.s8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      "{%8,  %9,  %10, %11},"
+      " %12,"
+      " %13, %14,"
+      " p;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x16x64_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x32x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x32x64_S32S8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[16];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %23, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n32k64.s32.s8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      "{%16, %17, %18, %19},"
+      " %20,"
+      " %21, %22,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x32x64_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x32x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x32x64_S32S8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[16];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %23, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n32k64.s32.s8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      "{%16, %17, %18, %19},"
+      " %20,"
+      " %21, %22,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x32x64_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x64x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x64x64_S32S8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[32];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %39, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n64k64.s32.s8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      "{%32, %33, %34, %35},"
+      " %36,"
+      " %37, %38,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x64x64_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x64x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x64x64_S32S8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[32];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %39, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n64k64.s32.s8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      "{%32, %33, %34, %35},"
+      " %36,"
+      " %37, %38,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x64x64_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x96x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x96x64_S32S8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[48];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %55, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n96k64.s32.s8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
+      "{%48,  %49,  %50,  %51},"
+      " %52,"
+      " %53, %54,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x96x64_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x96x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x96x64_S32S8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[48];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %55, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n96k64.s32.s8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
+      "{%48,  %49,  %50,  %51},"
+      " %52,"
+      " %53, %54,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x96x64_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x128x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x128x64_S32S8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[64];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %71, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n128k64.s32.s8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      "{%64,  %65,  %66,  %67},"
+      " %68,"
+      " %69, %70,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x128x64_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x128x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x128x64_S32S8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[64];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %71, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n128k64.s32.s8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      "{%64,  %65,  %66,  %67},"
+      " %68,"
+      " %69, %70,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x128x64_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x192x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x192x64_S32S8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[96];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
+      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %103, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n192k64.s32.s8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      "{%96,  %97,  %98,  %99},"
+      " %100,"
+      " %101, %102,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
+        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
+        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x192x64_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x192x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x192x64_S32S8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[96];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
+      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %103, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n192k64.s32.s8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      "{%96,  %97,  %98,  %99},"
+      " %100,"
+      " %101, %102,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
+        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
+        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x192x64_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x256x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x256x64_S32S8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[128];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
+      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %135, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n256k64.s32.s8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      "{%128, %129, %130, %131},"
+      " %132,"
+      " %133, %134,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
+        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
+        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x256x64_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x256x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x256x64_S32S8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[128];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
+      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %135, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n256k64.s32.s8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      "{%128, %129, %130, %131},"
+      " %132,"
+      " %133, %134,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
+        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
+        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x256x64_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x8x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x8x64_S32S8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %8, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n8k64.s32.s8.u8 "
+      "{%0,  %1,  %2,  %3},"
+      " %4,"
+      " %5,"
+      " %6, %7,"
+      " p;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x8x64_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x8x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x8x64_S32S8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %8, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n8k64.s32.s8.u8.satfinite "
+      "{%0,  %1,  %2,  %3},"
+      " %4,"
+      " %5,"
+      " %6, %7,"
+      " p;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x8x64_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x16x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x16x64_S32S8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[8];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %12, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n16k64.s32.s8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      " %8,"
+      " %9,"
+      " %10, %11,"
+      " p;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x16x64_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x16x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x16x64_S32S8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[8];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %12, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n16k64.s32.s8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      " %8,"
+      " %9,"
+      " %10, %11,"
+      " p;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x16x64_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x32x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x32x64_S32S8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[16];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %20, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n32k64.s32.s8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      " %16,"
+      " %17,"
+      " %18, %19,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x32x64_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x32x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x32x64_S32S8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[16];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %20, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n32k64.s32.s8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      " %16,"
+      " %17,"
+      " %18, %19,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x32x64_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x64x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x64x64_S32S8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[32];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %36, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n64k64.s32.s8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      " %32,"
+      " %33,"
+      " %34, %35,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x64x64_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x64x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x64x64_S32S8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[32];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %36, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n64k64.s32.s8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      " %32,"
+      " %33,"
+      " %34, %35,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x64x64_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x96x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x96x64_S32S8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[48];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %52, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n96k64.s32.s8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45, %46, %47},"
+      " %48,"
+      " %49,"
+      " %50, %51,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x96x64_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x96x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x96x64_S32S8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[48];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %52, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n96k64.s32.s8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45, %46, %47},"
+      " %48,"
+      " %49,"
+      " %50, %51,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x96x64_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x128x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x128x64_S32S8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[64];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %68, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n128k64.s32.s8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      " %64,"
+      " %65,"
+      " %66, %67,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x128x64_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x128x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x128x64_S32S8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[64];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %68, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n128k64.s32.s8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      " %64,"
+      " %65,"
+      " %66, %67,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x128x64_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x192x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x192x64_S32S8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[96];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
+      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %100, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n192k64.s32.s8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      " %96,"
+      " %97,"
+      " %98, %99,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
+        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
+        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x192x64_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x192x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x192x64_S32S8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[96];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
+      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %100, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n192k64.s32.s8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      " %96,"
+      " %97,"
+      " %98, %99,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
+        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
+        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x192x64_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x256x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x256x64_S32S8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[128];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
+      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %132, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n256k64.s32.s8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      " %128,"
+      " %129,"
+      " %130, %131,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
+        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
+        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x256x64_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x256x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x256x64_S32S8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[128];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
+      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %132, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n256k64.s32.s8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      " %128,"
+      " %129,"
+      " %130, %131,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
+        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
+        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x256x64_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x8x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x8x64_S32S8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %11, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n8k64.s32.s8.u8 "
+      "{%0,  %1,  %2,  %3},"
+      "{%4,  %5,  %6,  %7},"
+      " %8,"
+      " %9, %10,"
+      " p;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x8x64_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x8x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x8x64_S32S8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %11, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n8k64.s32.s8.u8.satfinite "
+      "{%0,  %1,  %2,  %3},"
+      "{%4,  %5,  %6,  %7},"
+      " %8,"
+      " %9, %10,"
+      " p;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x8x64_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x16x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x16x64_S32S8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[8];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %15, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n16k64.s32.s8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      "{%8,  %9,  %10, %11},"
+      " %12,"
+      " %13, %14,"
+      " p;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x16x64_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x16x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x16x64_S32S8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[8];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %15, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n16k64.s32.s8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      "{%8,  %9,  %10, %11},"
+      " %12,"
+      " %13, %14,"
+      " p;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x16x64_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x32x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x32x64_S32S8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[16];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %23, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n32k64.s32.s8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      "{%16, %17, %18, %19},"
+      " %20,"
+      " %21, %22,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x32x64_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x32x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x32x64_S32S8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[16];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %23, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n32k64.s32.s8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      "{%16, %17, %18, %19},"
+      " %20,"
+      " %21, %22,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x32x64_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x64x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x64x64_S32S8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[32];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %39, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n64k64.s32.s8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      "{%32, %33, %34, %35},"
+      " %36,"
+      " %37, %38,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x64x64_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x64x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x64x64_S32S8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[32];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %39, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n64k64.s32.s8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      "{%32, %33, %34, %35},"
+      " %36,"
+      " %37, %38,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x64x64_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x96x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x96x64_S32S8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[48];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %55, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n96k64.s32.s8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
+      "{%48,  %49,  %50,  %51},"
+      " %52,"
+      " %53, %54,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x96x64_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x96x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x96x64_S32S8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[48];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %55, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n96k64.s32.s8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
+      "{%48,  %49,  %50,  %51},"
+      " %52,"
+      " %53, %54,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x96x64_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x128x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x128x64_S32S8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[64];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %71, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n128k64.s32.s8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      "{%64,  %65,  %66,  %67},"
+      " %68,"
+      " %69, %70,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x128x64_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x128x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x128x64_S32S8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[64];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %71, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n128k64.s32.s8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      "{%64,  %65,  %66,  %67},"
+      " %68,"
+      " %69, %70,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x128x64_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x192x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x192x64_S32S8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[96];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
+      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %103, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n192k64.s32.s8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      "{%96,  %97,  %98,  %99},"
+      " %100,"
+      " %101, %102,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
+        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
+        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x192x64_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x192x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x192x64_S32S8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[96];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
+      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %103, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n192k64.s32.s8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      "{%96,  %97,  %98,  %99},"
+      " %100,"
+      " %101, %102,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
+        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
+        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x192x64_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x256x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x256x64_S32S8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[128];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
+      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %135, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n256k64.s32.s8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      "{%128, %129, %130, %131},"
+      " %132,"
+      " %133, %134,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
+        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
+        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x256x64_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x256x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x256x64_S32S8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[128];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
+      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %135, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n256k64.s32.s8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      "{%128, %129, %130, %131},"
+      " %132,"
+      " %133, %134,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
+        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
+        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x256x64_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x8x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x8x64_S32U8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %8, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n8k64.s32.u8.s8 "
+      "{%0,  %1,  %2,  %3},"
+      " %4,"
+      " %5,"
+      " %6, %7,"
+      " p;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x8x64_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x8x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x8x64_S32U8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %8, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n8k64.s32.u8.s8.satfinite "
+      "{%0,  %1,  %2,  %3},"
+      " %4,"
+      " %5,"
+      " %6, %7,"
+      " p;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x8x64_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x16x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x16x64_S32U8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[8];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %12, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n16k64.s32.u8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      " %8,"
+      " %9,"
+      " %10, %11,"
+      " p;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x16x64_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x16x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x16x64_S32U8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[8];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %12, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n16k64.s32.u8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      " %8,"
+      " %9,"
+      " %10, %11,"
+      " p;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x16x64_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x32x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x32x64_S32U8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[16];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %20, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n32k64.s32.u8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      " %16,"
+      " %17,"
+      " %18, %19,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x32x64_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x32x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x32x64_S32U8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[16];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %20, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n32k64.s32.u8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      " %16,"
+      " %17,"
+      " %18, %19,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x32x64_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x64x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x64x64_S32U8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[32];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %36, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n64k64.s32.u8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      " %32,"
+      " %33,"
+      " %34, %35,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x64x64_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x64x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x64x64_S32U8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[32];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %36, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n64k64.s32.u8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      " %32,"
+      " %33,"
+      " %34, %35,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x64x64_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x96x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x96x64_S32U8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[48];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %52, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n96k64.s32.u8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45, %46, %47},"
+      " %48,"
+      " %49,"
+      " %50, %51,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x96x64_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x96x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x96x64_S32U8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[48];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %52, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n96k64.s32.u8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45, %46, %47},"
+      " %48,"
+      " %49,"
+      " %50, %51,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x96x64_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x128x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x128x64_S32U8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[64];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %68, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n128k64.s32.u8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      " %64,"
+      " %65,"
+      " %66, %67,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x128x64_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x128x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x128x64_S32U8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[64];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %68, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n128k64.s32.u8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      " %64,"
+      " %65,"
+      " %66, %67,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x128x64_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x192x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x192x64_S32U8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[96];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
+      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %100, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n192k64.s32.u8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      " %96,"
+      " %97,"
+      " %98, %99,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
+        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
+        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x192x64_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x192x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x192x64_S32U8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[96];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
+      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %100, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n192k64.s32.u8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      " %96,"
+      " %97,"
+      " %98, %99,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
+        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
+        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x192x64_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x256x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x256x64_S32U8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[128];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
+      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %132, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n256k64.s32.u8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      " %128,"
+      " %129,"
+      " %130, %131,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
+        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
+        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x256x64_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x256x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x256x64_S32U8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[128];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
+      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %132, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n256k64.s32.u8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      " %128,"
+      " %129,"
+      " %130, %131,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
+        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
+        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x256x64_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x8x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x8x64_S32U8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %11, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n8k64.s32.u8.s8 "
+      "{%0,  %1,  %2,  %3},"
+      "{%4,  %5,  %6,  %7},"
+      " %8,"
+      " %9, %10,"
+      " p;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x8x64_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x8x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x8x64_S32U8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %11, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n8k64.s32.u8.s8.satfinite "
+      "{%0,  %1,  %2,  %3},"
+      "{%4,  %5,  %6,  %7},"
+      " %8,"
+      " %9, %10,"
+      " p;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x8x64_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x16x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x16x64_S32U8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[8];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %15, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n16k64.s32.u8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      "{%8,  %9,  %10, %11},"
+      " %12,"
+      " %13, %14,"
+      " p;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x16x64_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x16x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x16x64_S32U8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[8];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %15, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n16k64.s32.u8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      "{%8,  %9,  %10, %11},"
+      " %12,"
+      " %13, %14,"
+      " p;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x16x64_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x32x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x32x64_S32U8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[16];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %23, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n32k64.s32.u8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      "{%16, %17, %18, %19},"
+      " %20,"
+      " %21, %22,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x32x64_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x32x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x32x64_S32U8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[16];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %23, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n32k64.s32.u8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      "{%16, %17, %18, %19},"
+      " %20,"
+      " %21, %22,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x32x64_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x64x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x64x64_S32U8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[32];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %39, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n64k64.s32.u8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      "{%32, %33, %34, %35},"
+      " %36,"
+      " %37, %38,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x64x64_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x64x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x64x64_S32U8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[32];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %39, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n64k64.s32.u8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      "{%32, %33, %34, %35},"
+      " %36,"
+      " %37, %38,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x64x64_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x96x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x96x64_S32U8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[48];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %55, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n96k64.s32.u8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
+      "{%48,  %49,  %50,  %51},"
+      " %52,"
+      " %53, %54,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x96x64_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x96x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x96x64_S32U8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[48];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %55, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n96k64.s32.u8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
+      "{%48,  %49,  %50,  %51},"
+      " %52,"
+      " %53, %54,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x96x64_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x128x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x128x64_S32U8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[64];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %71, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n128k64.s32.u8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      "{%64,  %65,  %66,  %67},"
+      " %68,"
+      " %69, %70,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x128x64_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x128x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x128x64_S32U8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[64];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %71, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n128k64.s32.u8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      "{%64,  %65,  %66,  %67},"
+      " %68,"
+      " %69, %70,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x128x64_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x192x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x192x64_S32U8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[96];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
+      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %103, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n192k64.s32.u8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      "{%96,  %97,  %98,  %99},"
+      " %100,"
+      " %101, %102,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
+        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
+        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x192x64_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x192x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x192x64_S32U8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[96];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
+      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %103, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n192k64.s32.u8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      "{%96,  %97,  %98,  %99},"
+      " %100,"
+      " %101, %102,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
+        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
+        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x192x64_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x256x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x256x64_S32U8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[128];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
+      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %135, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n256k64.s32.u8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      "{%128, %129, %130, %131},"
+      " %132,"
+      " %133, %134,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
+        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
+        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x256x64_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x256x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x256x64_S32U8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[128];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
+      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %135, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n256k64.s32.u8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      "{%128, %129, %130, %131},"
+      " %132,"
+      " %133, %134,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
+        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
+        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x256x64_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x8x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x8x64_S32U8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %8, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n8k64.s32.u8.u8 "
+      "{%0,  %1,  %2,  %3},"
+      " %4,"
+      " %5,"
+      " %6, %7,"
+      " p;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x8x64_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x8x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x8x64_S32U8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %8, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n8k64.s32.u8.u8.satfinite "
+      "{%0,  %1,  %2,  %3},"
+      " %4,"
+      " %5,"
+      " %6, %7,"
+      " p;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x8x64_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x16x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x16x64_S32U8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[8];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %12, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n16k64.s32.u8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      " %8,"
+      " %9,"
+      " %10, %11,"
+      " p;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x16x64_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x16x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x16x64_S32U8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[8];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %12, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n16k64.s32.u8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      " %8,"
+      " %9,"
+      " %10, %11,"
+      " p;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x16x64_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x32x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x32x64_S32U8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[16];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %20, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n32k64.s32.u8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      " %16,"
+      " %17,"
+      " %18, %19,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x32x64_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x32x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x32x64_S32U8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[16];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %20, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n32k64.s32.u8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      " %16,"
+      " %17,"
+      " %18, %19,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x32x64_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x64x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x64x64_S32U8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[32];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %36, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n64k64.s32.u8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      " %32,"
+      " %33,"
+      " %34, %35,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x64x64_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x64x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x64x64_S32U8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[32];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %36, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n64k64.s32.u8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      " %32,"
+      " %33,"
+      " %34, %35,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x64x64_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x96x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x96x64_S32U8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[48];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %52, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n96k64.s32.u8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45, %46, %47},"
+      " %48,"
+      " %49,"
+      " %50, %51,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x96x64_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x96x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x96x64_S32U8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[48];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %52, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n96k64.s32.u8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45, %46, %47},"
+      " %48,"
+      " %49,"
+      " %50, %51,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x96x64_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x128x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x128x64_S32U8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[64];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %68, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n128k64.s32.u8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      " %64,"
+      " %65,"
+      " %66, %67,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x128x64_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x128x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x128x64_S32U8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[64];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %68, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n128k64.s32.u8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      " %64,"
+      " %65,"
+      " %66, %67,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x128x64_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x192x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x192x64_S32U8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[96];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
+      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %100, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n192k64.s32.u8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      " %96,"
+      " %97,"
+      " %98, %99,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
+        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
+        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x192x64_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x192x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x192x64_S32U8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[96];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
+      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %100, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n192k64.s32.u8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      " %96,"
+      " %97,"
+      " %98, %99,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
+        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
+        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x192x64_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x256x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x256x64_S32U8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[128];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
+      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %132, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n256k64.s32.u8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      " %128,"
+      " %129,"
+      " %130, %131,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
+        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
+        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x256x64_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x256x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x256x64_S32U8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[128];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
+      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %132, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n256k64.s32.u8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      " %128,"
+      " %129,"
+      " %130, %131,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
+        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
+        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x256x64_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x8x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x8x64_S32U8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %11, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n8k64.s32.u8.u8 "
+      "{%0,  %1,  %2,  %3},"
+      "{%4,  %5,  %6,  %7},"
+      " %8,"
+      " %9, %10,"
+      " p;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x8x64_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x8x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x8x64_S32U8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %11, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n8k64.s32.u8.u8.satfinite "
+      "{%0,  %1,  %2,  %3},"
+      "{%4,  %5,  %6,  %7},"
+      " %8,"
+      " %9, %10,"
+      " p;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x8x64_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x16x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x16x64_S32U8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[8];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %15, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n16k64.s32.u8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      "{%8,  %9,  %10, %11},"
+      " %12,"
+      " %13, %14,"
+      " p;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x16x64_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x16x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x16x64_S32U8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[8];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %15, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n16k64.s32.u8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      "{%8,  %9,  %10, %11},"
+      " %12,"
+      " %13, %14,"
+      " p;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x16x64_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x32x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x32x64_S32U8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[16];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %23, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n32k64.s32.u8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      "{%16, %17, %18, %19},"
+      " %20,"
+      " %21, %22,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x32x64_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x32x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x32x64_S32U8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[16];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %23, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n32k64.s32.u8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      "{%16, %17, %18, %19},"
+      " %20,"
+      " %21, %22,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x32x64_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x64x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x64x64_S32U8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[32];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %39, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n64k64.s32.u8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      "{%32, %33, %34, %35},"
+      " %36,"
+      " %37, %38,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x64x64_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x64x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x64x64_S32U8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[32];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %39, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n64k64.s32.u8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      "{%32, %33, %34, %35},"
+      " %36,"
+      " %37, %38,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x64x64_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x96x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x96x64_S32U8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[48];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %55, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n96k64.s32.u8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
+      "{%48,  %49,  %50,  %51},"
+      " %52,"
+      " %53, %54,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x96x64_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x96x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x96x64_S32U8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[48];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %55, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n96k64.s32.u8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
+      "{%48,  %49,  %50,  %51},"
+      " %52,"
+      " %53, %54,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x96x64_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x128x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x128x64_S32U8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[64];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %71, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n128k64.s32.u8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      "{%64,  %65,  %66,  %67},"
+      " %68,"
+      " %69, %70,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x128x64_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x128x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x128x64_S32U8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[64];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %71, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n128k64.s32.u8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      "{%64,  %65,  %66,  %67},"
+      " %68,"
+      " %69, %70,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x128x64_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x192x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x192x64_S32U8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[96];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
+      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %103, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n192k64.s32.u8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      "{%96,  %97,  %98,  %99},"
+      " %100,"
+      " %101, %102,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
+        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
+        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x192x64_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x192x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x192x64_S32U8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[96];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t      & d88, uint32_t      & d89, uint32_t      & d90, uint32_t      & d91,
+      uint32_t      & d92, uint32_t      & d93, uint32_t      & d94, uint32_t      & d95,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %103, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n192k64.s32.u8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      "{%96,  %97,  %98,  %99},"
+      " %100,"
+      " %101, %102,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87),
+        "+r"(d88), "+r"(d89), "+r"(d90), "+r"(d91),
+        "+r"(d92), "+r"(d93), "+r"(d94), "+r"(d95)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x192x64_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x256x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x256x64_S32U8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[128];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
+      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %135, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n256k64.s32.u8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      "{%128, %129, %130, %131},"
+      " %132,"
+      " %133, %134,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
+        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
+        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x256x64_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x256x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x256x64_S32U8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[128];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t      & d120, uint32_t      & d121, uint32_t      & d122, uint32_t      & d123,
+      uint32_t      & d124, uint32_t      & d125, uint32_t      & d126, uint32_t      & d127,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %135, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n256k64.s32.u8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      "{%128, %129, %130, %131},"
+      " %132,"
+      " %133, %134,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119),
+        "+r"(d120), "+r"(d121), "+r"(d122), "+r"(d123),
+        "+r"(d124), "+r"(d125), "+r"(d126), "+r"(d127)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x256x64_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x8x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x8x64_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[2];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %6, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n8k64.f16.e4m3.e4m3 "
+      "{%0, %1},"
+      " %2,"
+      " %3,"
+      " %4, %5,"
+      " p,  %7, %8;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x8x64_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x8x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x8x64_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[2];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %9, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n8k64.f16.e4m3.e4m3 "
+      "{%0,  %1},"
+      "{%2,  %3,  %4,  %5},"
+      " %6,"
+      " %7, %8,"
+      " p,   %10, %11;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x8x64_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x8x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x8x64_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d0, float         & d1, float         & d2, float         & d3,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %8, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n8k64.f32.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3},"
+      " %4,"
+      " %5,"
+      " %6, %7,"
+      " p,   %9,  %10;\n"
+    "}\n"
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x8x64_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x8x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x8x64_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      float         & d0, float         & d1, float         & d2, float         & d3,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %11, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n8k64.f32.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3},"
+      "{%4,  %5,  %6,  %7},"
+      " %8,"
+      " %9, %10,"
+      " p,   %12, %13;\n"
+    "}\n"
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x8x64_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x16x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x16x64_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %8, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n16k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3},"
+      " %4,"
+      " %5,"
+      " %6, %7,"
+      " p,   %9,  %10;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x16x64_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x16x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x16x64_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %11, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n16k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3},"
+      "{%4,  %5,  %6,  %7},"
+      " %8,"
+      " %9, %10,"
+      " p,   %12, %13;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x16x64_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x16x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x16x64_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[8];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d0, float         & d1, float         & d2, float         & d3,
+      float         & d4, float         & d5, float         & d6, float         & d7,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %12, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n16k64.f32.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      " %8,"
+      " %9,"
+      " %10, %11,"
+      " p,   %13, %14;\n"
+    "}\n"
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3),
+        "+f"(d4), "+f"(d5), "+f"(d6), "+f"(d7)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x16x64_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x16x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x16x64_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[8];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      float         & d0, float         & d1, float         & d2, float         & d3,
+      float         & d4, float         & d5, float         & d6, float         & d7,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %15, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n16k64.f32.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      "{%8,  %9,  %10, %11},"
+      " %12,"
+      " %13, %14,"
+      " p,   %16, %17;\n"
+    "}\n"
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3),
+        "+f"(d4), "+f"(d5), "+f"(d6), "+f"(d7)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x16x64_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x32x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x32x64_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[8];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %12, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n32k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      " %8,"
+      " %9,"
+      " %10, %11,"
+      " p,   %13, %14;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x32x64_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x32x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x32x64_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[8];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %15, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n32k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      "{%8,  %9,  %10, %11},"
+      " %12,"
+      " %13, %14,"
+      " p,   %16, %17;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x32x64_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x32x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x32x64_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[16];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %20, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n32k64.f32.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      " %16,"
+      " %17,"
+      " %18, %19,"
+      " p,   %21, %22;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x32x64_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x32x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x32x64_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[16];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %23, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n32k64.f32.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      "{%16, %17, %18, %19},"
+      " %20,"
+      " %21, %22,"
+      " p,   %24, %25;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x32x64_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x64x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x64x64_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[16];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %20, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n64k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      " %16,"
+      " %17,"
+      " %18, %19,"
+      " p,   %21, %22;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x64x64_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x64x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x64x64_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[16];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %23, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n64k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      "{%16, %17, %18, %19},"
+      " %20,"
+      " %21, %22,"
+      " p,   %24, %25;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x64x64_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x64x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x64x64_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[32];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %36, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n64k64.f32.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      " %32,"
+      " %33,"
+      " %34, %35,"
+      " p,   %37, %38;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x64x64_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x64x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x64x64_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[32];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %39, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n64k64.f32.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      "{%32, %33, %34, %35},"
+      " %36,"
+      " %37, %38,"
+      " p,   %40, %41;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x64x64_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x96x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x96x64_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %28, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n96k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      " %24,"
+      " %25,"
+      " %26, %27,"
+      " p,   %29, %30;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x96x64_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x96x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x96x64_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %31, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n96k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      "{%24, %25, %26, %27},"
+      " %28,"
+      " %29, %30,"
+      " p,   %32, %33;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x96x64_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x96x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x96x64_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[48];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %52, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n96k64.f32.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45, %46, %47},"
+      " %48,"
+      " %49,"
+      " %50, %51,"
+      " p,   %53, %54;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x96x64_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x96x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x96x64_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[48];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %55, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n96k64.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
+      "{%48,  %49,  %50,  %51},"
+      " %52,"
+      " %53, %54,"
+      " p,    %56,  %57;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x96x64_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x128x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x128x64_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[32];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %36, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n128k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      " %32,"
+      " %33,"
+      " %34, %35,"
+      " p,   %37, %38;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x128x64_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x128x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x128x64_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[32];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %39, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n128k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      "{%32, %33, %34, %35},"
+      " %36,"
+      " %37, %38,"
+      " p,   %40, %41;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x128x64_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x128x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x128x64_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[64];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %68, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n128k64.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      " %64,"
+      " %65,"
+      " %66, %67,"
+      " p,    %69,  %70;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x128x64_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x128x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x128x64_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[64];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %71, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n128k64.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      "{%64,  %65,  %66,  %67},"
+      " %68,"
+      " %69, %70,"
+      " p,    %72,  %73;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x128x64_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x192x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x192x64_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[48];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %52, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n192k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45, %46, %47},"
+      " %48,"
+      " %49,"
+      " %50, %51,"
+      " p,   %53, %54;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x192x64_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x192x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x192x64_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[48];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %55, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n192k64.f16.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
+      "{%48,  %49,  %50,  %51},"
+      " %52,"
+      " %53, %54,"
+      " p,    %56,  %57;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x192x64_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x192x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x192x64_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[96];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      float         & d92, float         & d93, float         & d94, float         & d95,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %100, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n192k64.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      " %96,"
+      " %97,"
+      " %98, %99,"
+      " p,    %101, %102;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91),
+        "+f"(d92), "+f"(d93), "+f"(d94), "+f"(d95)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x192x64_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x192x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x192x64_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[96];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      float         & d92, float         & d93, float         & d94, float         & d95,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %103, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n192k64.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      "{%96,  %97,  %98,  %99},"
+      " %100,"
+      " %101, %102,"
+      " p,    %104, %105;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91),
+        "+f"(d92), "+f"(d93), "+f"(d94), "+f"(d95)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x192x64_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x256x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x256x64_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[64];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %68, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n256k64.f16.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      " %64,"
+      " %65,"
+      " %66, %67,"
+      " p,    %69,  %70;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x256x64_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x256x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x256x64_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[64];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %71, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n256k64.f16.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      "{%64,  %65,  %66,  %67},"
+      " %68,"
+      " %69, %70,"
+      " p,    %72,  %73;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x256x64_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x256x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x256x64_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[128];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      float         & d124, float         & d125, float         & d126, float         & d127,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %132, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n256k64.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      " %128,"
+      " %129,"
+      " %130, %131,"
+      " p,    %133, %134;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123),
+        "+f"(d124), "+f"(d125), "+f"(d126), "+f"(d127)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x256x64_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x256x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x256x64_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[128];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      float         & d124, float         & d125, float         & d126, float         & d127,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %135, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n256k64.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      "{%128, %129, %130, %131},"
+      " %132,"
+      " %133, %134,"
+      " p,    %136, %137;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123),
+        "+f"(d124), "+f"(d125), "+f"(d126), "+f"(d127)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x256x64_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x8x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x8x64_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[2];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %6, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n8k64.f16.e4m3.e5m2 "
+      "{%0, %1},"
+      " %2,"
+      " %3,"
+      " %4, %5,"
+      " p,  %7, %8;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x8x64_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x8x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x8x64_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[2];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %9, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n8k64.f16.e4m3.e5m2 "
+      "{%0,  %1},"
+      "{%2,  %3,  %4,  %5},"
+      " %6,"
+      " %7, %8,"
+      " p,   %10, %11;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x8x64_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x8x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x8x64_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d0, float         & d1, float         & d2, float         & d3,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %8, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n8k64.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3},"
+      " %4,"
+      " %5,"
+      " %6, %7,"
+      " p,   %9,  %10;\n"
+    "}\n"
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x8x64_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x8x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x8x64_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      float         & d0, float         & d1, float         & d2, float         & d3,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %11, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n8k64.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3},"
+      "{%4,  %5,  %6,  %7},"
+      " %8,"
+      " %9, %10,"
+      " p,   %12, %13;\n"
+    "}\n"
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x8x64_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x16x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x16x64_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %8, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n16k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3},"
+      " %4,"
+      " %5,"
+      " %6, %7,"
+      " p,   %9,  %10;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x16x64_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x16x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x16x64_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %11, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n16k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3},"
+      "{%4,  %5,  %6,  %7},"
+      " %8,"
+      " %9, %10,"
+      " p,   %12, %13;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x16x64_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x16x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x16x64_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[8];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d0, float         & d1, float         & d2, float         & d3,
+      float         & d4, float         & d5, float         & d6, float         & d7,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %12, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n16k64.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      " %8,"
+      " %9,"
+      " %10, %11,"
+      " p,   %13, %14;\n"
+    "}\n"
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3),
+        "+f"(d4), "+f"(d5), "+f"(d6), "+f"(d7)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x16x64_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x16x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x16x64_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[8];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      float         & d0, float         & d1, float         & d2, float         & d3,
+      float         & d4, float         & d5, float         & d6, float         & d7,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %15, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n16k64.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      "{%8,  %9,  %10, %11},"
+      " %12,"
+      " %13, %14,"
+      " p,   %16, %17;\n"
+    "}\n"
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3),
+        "+f"(d4), "+f"(d5), "+f"(d6), "+f"(d7)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x16x64_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x32x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x32x64_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[8];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %12, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n32k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      " %8,"
+      " %9,"
+      " %10, %11,"
+      " p,   %13, %14;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x32x64_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x32x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x32x64_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[8];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %15, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n32k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      "{%8,  %9,  %10, %11},"
+      " %12,"
+      " %13, %14,"
+      " p,   %16, %17;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x32x64_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x32x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x32x64_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[16];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %20, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n32k64.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      " %16,"
+      " %17,"
+      " %18, %19,"
+      " p,   %21, %22;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x32x64_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x32x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x32x64_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[16];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %23, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n32k64.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      "{%16, %17, %18, %19},"
+      " %20,"
+      " %21, %22,"
+      " p,   %24, %25;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x32x64_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x64x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x64x64_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[16];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %20, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n64k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      " %16,"
+      " %17,"
+      " %18, %19,"
+      " p,   %21, %22;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x64x64_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x64x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x64x64_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[16];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %23, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n64k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      "{%16, %17, %18, %19},"
+      " %20,"
+      " %21, %22,"
+      " p,   %24, %25;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x64x64_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x64x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x64x64_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[32];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %36, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n64k64.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      " %32,"
+      " %33,"
+      " %34, %35,"
+      " p,   %37, %38;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x64x64_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x64x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x64x64_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[32];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %39, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n64k64.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      "{%32, %33, %34, %35},"
+      " %36,"
+      " %37, %38,"
+      " p,   %40, %41;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x64x64_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x96x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x96x64_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %28, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n96k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      " %24,"
+      " %25,"
+      " %26, %27,"
+      " p,   %29, %30;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x96x64_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x96x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x96x64_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %31, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n96k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      "{%24, %25, %26, %27},"
+      " %28,"
+      " %29, %30,"
+      " p,   %32, %33;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x96x64_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x96x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x96x64_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[48];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %52, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n96k64.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45, %46, %47},"
+      " %48,"
+      " %49,"
+      " %50, %51,"
+      " p,   %53, %54;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x96x64_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x96x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x96x64_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[48];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %55, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n96k64.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
+      "{%48,  %49,  %50,  %51},"
+      " %52,"
+      " %53, %54,"
+      " p,    %56,  %57;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x96x64_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x128x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x128x64_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[32];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %36, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n128k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      " %32,"
+      " %33,"
+      " %34, %35,"
+      " p,   %37, %38;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x128x64_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x128x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x128x64_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[32];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %39, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n128k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      "{%32, %33, %34, %35},"
+      " %36,"
+      " %37, %38,"
+      " p,   %40, %41;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x128x64_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x128x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x128x64_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[64];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %68, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n128k64.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      " %64,"
+      " %65,"
+      " %66, %67,"
+      " p,    %69,  %70;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x128x64_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x128x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x128x64_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[64];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %71, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n128k64.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      "{%64,  %65,  %66,  %67},"
+      " %68,"
+      " %69, %70,"
+      " p,    %72,  %73;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x128x64_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x192x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x192x64_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[48];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %52, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n192k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45, %46, %47},"
+      " %48,"
+      " %49,"
+      " %50, %51,"
+      " p,   %53, %54;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x192x64_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x192x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x192x64_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[48];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %55, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n192k64.f16.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
+      "{%48,  %49,  %50,  %51},"
+      " %52,"
+      " %53, %54,"
+      " p,    %56,  %57;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x192x64_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x192x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x192x64_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[96];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      float         & d92, float         & d93, float         & d94, float         & d95,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %100, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n192k64.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      " %96,"
+      " %97,"
+      " %98, %99,"
+      " p,    %101, %102;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91),
+        "+f"(d92), "+f"(d93), "+f"(d94), "+f"(d95)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x192x64_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x192x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x192x64_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[96];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      float         & d92, float         & d93, float         & d94, float         & d95,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %103, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n192k64.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      "{%96,  %97,  %98,  %99},"
+      " %100,"
+      " %101, %102,"
+      " p,    %104, %105;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91),
+        "+f"(d92), "+f"(d93), "+f"(d94), "+f"(d95)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x192x64_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x256x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x256x64_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[64];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %68, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n256k64.f16.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      " %64,"
+      " %65,"
+      " %66, %67,"
+      " p,    %69,  %70;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x256x64_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x256x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x256x64_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[64];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %71, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n256k64.f16.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      "{%64,  %65,  %66,  %67},"
+      " %68,"
+      " %69, %70,"
+      " p,    %72,  %73;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x256x64_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x256x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x256x64_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[128];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      float         & d124, float         & d125, float         & d126, float         & d127,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %132, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n256k64.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      " %128,"
+      " %129,"
+      " %130, %131,"
+      " p,    %133, %134;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123),
+        "+f"(d124), "+f"(d125), "+f"(d126), "+f"(d127)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x256x64_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x256x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x256x64_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[128];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      float         & d124, float         & d125, float         & d126, float         & d127,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %135, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n256k64.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      "{%128, %129, %130, %131},"
+      " %132,"
+      " %133, %134,"
+      " p,    %136, %137;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123),
+        "+f"(d124), "+f"(d125), "+f"(d126), "+f"(d127)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x256x64_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x8x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x8x64_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[2];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %6, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n8k64.f16.e5m2.e4m3 "
+      "{%0, %1},"
+      " %2,"
+      " %3,"
+      " %4, %5,"
+      " p,  %7, %8;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x8x64_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x8x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x8x64_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[2];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %9, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n8k64.f16.e5m2.e4m3 "
+      "{%0,  %1},"
+      "{%2,  %3,  %4,  %5},"
+      " %6,"
+      " %7, %8,"
+      " p,   %10, %11;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x8x64_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x8x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x8x64_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d0, float         & d1, float         & d2, float         & d3,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %8, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n8k64.f32.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3},"
+      " %4,"
+      " %5,"
+      " %6, %7,"
+      " p,   %9,  %10;\n"
+    "}\n"
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x8x64_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x8x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x8x64_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      float         & d0, float         & d1, float         & d2, float         & d3,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %11, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n8k64.f32.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3},"
+      "{%4,  %5,  %6,  %7},"
+      " %8,"
+      " %9, %10,"
+      " p,   %12, %13;\n"
+    "}\n"
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x8x64_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x16x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x16x64_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %8, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n16k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3},"
+      " %4,"
+      " %5,"
+      " %6, %7,"
+      " p,   %9,  %10;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x16x64_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x16x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x16x64_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %11, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n16k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3},"
+      "{%4,  %5,  %6,  %7},"
+      " %8,"
+      " %9, %10,"
+      " p,   %12, %13;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x16x64_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x16x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x16x64_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[8];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d0, float         & d1, float         & d2, float         & d3,
+      float         & d4, float         & d5, float         & d6, float         & d7,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %12, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n16k64.f32.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      " %8,"
+      " %9,"
+      " %10, %11,"
+      " p,   %13, %14;\n"
+    "}\n"
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3),
+        "+f"(d4), "+f"(d5), "+f"(d6), "+f"(d7)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x16x64_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x16x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x16x64_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[8];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      float         & d0, float         & d1, float         & d2, float         & d3,
+      float         & d4, float         & d5, float         & d6, float         & d7,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %15, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n16k64.f32.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      "{%8,  %9,  %10, %11},"
+      " %12,"
+      " %13, %14,"
+      " p,   %16, %17;\n"
+    "}\n"
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3),
+        "+f"(d4), "+f"(d5), "+f"(d6), "+f"(d7)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x16x64_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x32x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x32x64_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[8];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %12, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n32k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      " %8,"
+      " %9,"
+      " %10, %11,"
+      " p,   %13, %14;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x32x64_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x32x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x32x64_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[8];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %15, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n32k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      "{%8,  %9,  %10, %11},"
+      " %12,"
+      " %13, %14,"
+      " p,   %16, %17;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x32x64_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x32x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x32x64_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[16];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %20, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n32k64.f32.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      " %16,"
+      " %17,"
+      " %18, %19,"
+      " p,   %21, %22;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x32x64_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x32x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x32x64_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[16];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %23, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n32k64.f32.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      "{%16, %17, %18, %19},"
+      " %20,"
+      " %21, %22,"
+      " p,   %24, %25;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x32x64_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x64x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x64x64_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[16];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %20, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n64k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      " %16,"
+      " %17,"
+      " %18, %19,"
+      " p,   %21, %22;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x64x64_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x64x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x64x64_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[16];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %23, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n64k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      "{%16, %17, %18, %19},"
+      " %20,"
+      " %21, %22,"
+      " p,   %24, %25;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x64x64_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x64x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x64x64_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[32];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %36, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n64k64.f32.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      " %32,"
+      " %33,"
+      " %34, %35,"
+      " p,   %37, %38;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x64x64_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x64x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x64x64_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[32];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %39, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n64k64.f32.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      "{%32, %33, %34, %35},"
+      " %36,"
+      " %37, %38,"
+      " p,   %40, %41;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x64x64_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x96x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x96x64_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %28, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n96k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      " %24,"
+      " %25,"
+      " %26, %27,"
+      " p,   %29, %30;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x96x64_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x96x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x96x64_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %31, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n96k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      "{%24, %25, %26, %27},"
+      " %28,"
+      " %29, %30,"
+      " p,   %32, %33;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x96x64_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x96x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x96x64_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[48];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %52, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n96k64.f32.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45, %46, %47},"
+      " %48,"
+      " %49,"
+      " %50, %51,"
+      " p,   %53, %54;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x96x64_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x96x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x96x64_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[48];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %55, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n96k64.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
+      "{%48,  %49,  %50,  %51},"
+      " %52,"
+      " %53, %54,"
+      " p,    %56,  %57;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x96x64_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x128x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x128x64_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[32];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %36, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n128k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      " %32,"
+      " %33,"
+      " %34, %35,"
+      " p,   %37, %38;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x128x64_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x128x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x128x64_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[32];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %39, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n128k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      "{%32, %33, %34, %35},"
+      " %36,"
+      " %37, %38,"
+      " p,   %40, %41;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x128x64_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x128x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x128x64_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[64];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %68, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n128k64.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      " %64,"
+      " %65,"
+      " %66, %67,"
+      " p,    %69,  %70;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x128x64_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x128x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x128x64_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[64];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %71, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n128k64.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      "{%64,  %65,  %66,  %67},"
+      " %68,"
+      " %69, %70,"
+      " p,    %72,  %73;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x128x64_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x192x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x192x64_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[48];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %52, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n192k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45, %46, %47},"
+      " %48,"
+      " %49,"
+      " %50, %51,"
+      " p,   %53, %54;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x192x64_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x192x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x192x64_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[48];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %55, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n192k64.f16.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
+      "{%48,  %49,  %50,  %51},"
+      " %52,"
+      " %53, %54,"
+      " p,    %56,  %57;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x192x64_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x192x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x192x64_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[96];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      float         & d92, float         & d93, float         & d94, float         & d95,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %100, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n192k64.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      " %96,"
+      " %97,"
+      " %98, %99,"
+      " p,    %101, %102;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91),
+        "+f"(d92), "+f"(d93), "+f"(d94), "+f"(d95)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x192x64_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x192x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x192x64_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[96];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      float         & d92, float         & d93, float         & d94, float         & d95,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %103, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n192k64.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      "{%96,  %97,  %98,  %99},"
+      " %100,"
+      " %101, %102,"
+      " p,    %104, %105;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91),
+        "+f"(d92), "+f"(d93), "+f"(d94), "+f"(d95)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x192x64_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x256x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x256x64_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[64];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %68, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n256k64.f16.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      " %64,"
+      " %65,"
+      " %66, %67,"
+      " p,    %69,  %70;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x256x64_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x256x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x256x64_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[64];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %71, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n256k64.f16.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      "{%64,  %65,  %66,  %67},"
+      " %68,"
+      " %69, %70,"
+      " p,    %72,  %73;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x256x64_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x256x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x256x64_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[128];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      float         & d124, float         & d125, float         & d126, float         & d127,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %132, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n256k64.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      " %128,"
+      " %129,"
+      " %130, %131,"
+      " p,    %133, %134;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123),
+        "+f"(d124), "+f"(d125), "+f"(d126), "+f"(d127)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x256x64_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x256x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x256x64_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[128];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      float         & d124, float         & d125, float         & d126, float         & d127,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %135, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n256k64.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      "{%128, %129, %130, %131},"
+      " %132,"
+      " %133, %134,"
+      " p,    %136, %137;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123),
+        "+f"(d124), "+f"(d125), "+f"(d126), "+f"(d127)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x256x64_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x8x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x8x64_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[2];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %6, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n8k64.f16.e5m2.e5m2 "
+      "{%0, %1},"
+      " %2,"
+      " %3,"
+      " %4, %5,"
+      " p,  %7, %8;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x8x64_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x8x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x8x64_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[2];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %9, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n8k64.f16.e5m2.e5m2 "
+      "{%0,  %1},"
+      "{%2,  %3,  %4,  %5},"
+      " %6,"
+      " %7, %8,"
+      " p,   %10, %11;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x8x64_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x8x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x8x64_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d0, float         & d1, float         & d2, float         & d3,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %8, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n8k64.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3},"
+      " %4,"
+      " %5,"
+      " %6, %7,"
+      " p,   %9,  %10;\n"
+    "}\n"
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x8x64_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x8x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x8x64_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      float         & d0, float         & d1, float         & d2, float         & d3,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %11, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n8k64.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3},"
+      "{%4,  %5,  %6,  %7},"
+      " %8,"
+      " %9, %10,"
+      " p,   %12, %13;\n"
+    "}\n"
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x8x64_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x16x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x16x64_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %8, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n16k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3},"
+      " %4,"
+      " %5,"
+      " %6, %7,"
+      " p,   %9,  %10;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x16x64_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x16x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x16x64_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[4];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %11, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n16k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3},"
+      "{%4,  %5,  %6,  %7},"
+      " %8,"
+      " %9, %10,"
+      " p,   %12, %13;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x16x64_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x16x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x16x64_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[8];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d0, float         & d1, float         & d2, float         & d3,
+      float         & d4, float         & d5, float         & d6, float         & d7,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %12, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n16k64.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      " %8,"
+      " %9,"
+      " %10, %11,"
+      " p,   %13, %14;\n"
+    "}\n"
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3),
+        "+f"(d4), "+f"(d5), "+f"(d6), "+f"(d7)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x16x64_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x16x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x16x64_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[8];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      float         & d0, float         & d1, float         & d2, float         & d3,
+      float         & d4, float         & d5, float         & d6, float         & d7,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %15, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n16k64.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      "{%8,  %9,  %10, %11},"
+      " %12,"
+      " %13, %14,"
+      " p,   %16, %17;\n"
+    "}\n"
+      : "+f"(d0), "+f"(d1), "+f"(d2), "+f"(d3),
+        "+f"(d4), "+f"(d5), "+f"(d6), "+f"(d7)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x16x64_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x32x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x32x64_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[8];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %12, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n32k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      " %8,"
+      " %9,"
+      " %10, %11,"
+      " p,   %13, %14;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x32x64_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x32x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x32x64_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[8];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5, uint32_t      & d6, uint32_t      & d7,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %15, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n32k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7},"
+      "{%8,  %9,  %10, %11},"
+      " %12,"
+      " %13, %14,"
+      " p,   %16, %17;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5), "+r"(d6), "+r"(d7)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x32x64_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x32x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x32x64_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[16];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %20, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n32k64.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      " %16,"
+      " %17,"
+      " %18, %19,"
+      " p,   %21, %22;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x32x64_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x32x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x32x64_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[16];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %23, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n32k64.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      "{%16, %17, %18, %19},"
+      " %20,"
+      " %21, %22,"
+      " p,   %24, %25;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x32x64_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x64x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x64x64_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[16];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %20, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n64k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      " %16,"
+      " %17,"
+      " %18, %19,"
+      " p,   %21, %22;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x64x64_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x64x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x64x64_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[16];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %23, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n64k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15},"
+      "{%16, %17, %18, %19},"
+      " %20,"
+      " %21, %22,"
+      " p,   %24, %25;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x64x64_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x64x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x64x64_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[32];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %36, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n64k64.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      " %32,"
+      " %33,"
+      " %34, %35,"
+      " p,   %37, %38;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x64x64_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x64x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x64x64_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[32];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %39, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n64k64.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      "{%32, %33, %34, %35},"
+      " %36,"
+      " %37, %38,"
+      " p,   %40, %41;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x64x64_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x96x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x96x64_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %28, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n96k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      " %24,"
+      " %25,"
+      " %26, %27,"
+      " p,   %29, %30;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x96x64_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x96x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x96x64_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %31, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n96k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      "{%24, %25, %26, %27},"
+      " %28,"
+      " %29, %30,"
+      " p,   %32, %33;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x96x64_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x96x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x96x64_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[48];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %52, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n96k64.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45, %46, %47},"
+      " %48,"
+      " %49,"
+      " %50, %51,"
+      " p,   %53, %54;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x96x64_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x96x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x96x64_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[48];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %55, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n96k64.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
+      "{%48,  %49,  %50,  %51},"
+      " %52,"
+      " %53, %54,"
+      " p,    %56,  %57;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x96x64_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x128x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x128x64_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[32];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %36, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n128k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      " %32,"
+      " %33,"
+      " %34, %35,"
+      " p,   %37, %38;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x128x64_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x128x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x128x64_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[32];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %39, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n128k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31},"
+      "{%32, %33, %34, %35},"
+      " %36,"
+      " %37, %38,"
+      " p,   %40, %41;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x128x64_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x128x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x128x64_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[64];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %68, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n128k64.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      " %64,"
+      " %65,"
+      " %66, %67,"
+      " p,    %69,  %70;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x128x64_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x128x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x128x64_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[64];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %71, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n128k64.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      "{%64,  %65,  %66,  %67},"
+      " %68,"
+      " %69, %70,"
+      " p,    %72,  %73;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x128x64_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x192x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x192x64_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[48];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %52, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n192k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45, %46, %47},"
+      " %48,"
+      " %49,"
+      " %50, %51,"
+      " p,   %53, %54;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x192x64_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x192x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x192x64_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[48];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %55, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n192k64.f16.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47},"
+      "{%48,  %49,  %50,  %51},"
+      " %52,"
+      " %53, %54,"
+      " p,    %56,  %57;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x192x64_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x192x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x192x64_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[96];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      float         & d92, float         & d93, float         & d94, float         & d95,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %100, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n192k64.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      " %96,"
+      " %97,"
+      " %98, %99,"
+      " p,    %101, %102;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91),
+        "+f"(d92), "+f"(d93), "+f"(d94), "+f"(d95)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x192x64_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x192x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x192x64_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[96];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      float         & d92, float         & d93, float         & d94, float         & d95,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %103, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n192k64.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95},"
+      "{%96,  %97,  %98,  %99},"
+      " %100,"
+      " %101, %102,"
+      " p,    %104, %105;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91),
+        "+f"(d92), "+f"(d93), "+f"(d94), "+f"(d95)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x192x64_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x256x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x256x64_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[64];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %68, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n256k64.f16.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      " %64,"
+      " %65,"
+      " %66, %67,"
+      " p,    %69,  %70;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x256x64_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x256x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x256x64_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[64];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %71, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n256k64.f16.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63},"
+      "{%64,  %65,  %66,  %67},"
+      " %68,"
+      " %69, %70,"
+      " p,    %72,  %73;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x256x64_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x256x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x256x64_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[128];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      float         & d124, float         & d125, float         & d126, float         & d127,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %132, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n256k64.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      " %128,"
+      " %129,"
+      " %130, %131,"
+      " p,    %133, %134;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123),
+        "+f"(d124), "+f"(d125), "+f"(d126), "+f"(d127)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x256x64_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x256x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x256x64_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[128];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      float         & d124, float         & d125, float         & d126, float         & d127,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %135, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n256k64.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123, %124, %125, %126, %127},"
+      "{%128, %129, %130, %131},"
+      " %132,"
+      " %133, %134,"
+      " p,    %136, %137;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123),
+        "+f"(d124), "+f"(d125), "+f"(d126), "+f"(d127)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x256x64_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+} // namespace SM90::GMMA::SPARSE
+
+} // namespace cute
+
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+#include "mma_sm90_gmma_sparse_ext.hpp"
+#endif
diff --git a/include/cute/arch/mma_sm90_gmma_sparse_ext.hpp b/include/cute/arch/mma_sm90_gmma_sparse_ext.hpp
new file mode 100644
index 0000000000..c224e4034e
--- /dev/null
+++ b/include/cute/arch/mma_sm90_gmma_sparse_ext.hpp
@@ -0,0 +1,60445 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+ 
+#pragma once
+  
+#include <cute/config.hpp>                // CUTE_HOST_DEVICE
+
+#include "cutlass/arch/synclog.hpp"
+
+// Config
+#if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900) && defined(__CUDA_ARCH_FEAT_SM90_ALL))
+#  define CUTE_ARCH_MMA_SM90A_ENABLED
+#endif
+
+namespace cute {
+
+namespace SM90::GMMA::SPARSE {
+
+// SPARSE GMMA 64x24x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x24x32_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[6];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %10, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n24k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5},"
+      " %6,"
+      " %7,"
+      " %8, %9,"
+      " p,   %11, %12, %13, %14;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x24x32_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x24x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x24x32_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[6];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %13, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n24k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5},"
+      "{%6,  %7,  %8,  %9},"
+      " %10,"
+      " %11, %12,"
+      " p,   %14, %15, %16;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x24x32_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x40x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x40x32_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[10];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %14, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n40k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9},"
+      " %10,"
+      " %11,"
+      " %12, %13,"
+      " p,   %15, %16, %17, %18;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x40x32_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x40x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x40x32_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[10];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %17, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n40k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9},"
+      "{%10, %11, %12, %13},"
+      " %14,"
+      " %15, %16,"
+      " p,   %18, %19, %20;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x40x32_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x48x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x48x32_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %16, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n48k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      " %12,"
+      " %13,"
+      " %14, %15,"
+      " p,   %17, %18, %19, %20;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x48x32_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x48x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x48x32_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %19, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n48k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      "{%12, %13, %14, %15},"
+      " %16,"
+      " %17, %18,"
+      " p,   %20, %21, %22;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x48x32_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x56x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x56x32_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[14];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %18, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n56k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13},"
+      " %14,"
+      " %15,"
+      " %16, %17,"
+      " p,   %19, %20, %21, %22;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x56x32_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x56x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x56x32_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[14];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %21, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n56k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13},"
+      "{%14, %15, %16, %17},"
+      " %18,"
+      " %19, %20,"
+      " p,   %22, %23, %24;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x56x32_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x72x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x72x32_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[18];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %22, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n72k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17},"
+      " %18,"
+      " %19,"
+      " %20, %21,"
+      " p,   %23, %24, %25, %26;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x72x32_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x72x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x72x32_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[18];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %25, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n72k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17},"
+      "{%18, %19, %20, %21},"
+      " %22,"
+      " %23, %24,"
+      " p,   %26, %27, %28;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x72x32_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x80x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x80x32_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[20];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %24, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n80k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      " %20,"
+      " %21,"
+      " %22, %23,"
+      " p,   %25, %26, %27, %28;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x80x32_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x80x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x80x32_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[20];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %27, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n80k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      "{%20, %21, %22, %23},"
+      " %24,"
+      " %25, %26,"
+      " p,   %28, %29, %30;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x80x32_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x88x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x88x32_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[22];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %26, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n88k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21},"
+      " %22,"
+      " %23,"
+      " %24, %25,"
+      " p,   %27, %28, %29, %30;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x88x32_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x88x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x88x32_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[22];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %29, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n88k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21},"
+      "{%22, %23, %24, %25},"
+      " %26,"
+      " %27, %28,"
+      " p,   %30, %31, %32;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x88x32_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x104x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x104x32_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[26];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %30, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n104k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25},"
+      " %26,"
+      " %27,"
+      " %28, %29,"
+      " p,   %31, %32, %33, %34;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x104x32_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x104x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x104x32_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[26];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %33, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n104k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25},"
+      "{%26, %27, %28, %29},"
+      " %30,"
+      " %31, %32,"
+      " p,   %34, %35, %36;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x104x32_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x112x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x112x32_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[28];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %32, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n112k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      " %28,"
+      " %29,"
+      " %30, %31,"
+      " p,   %33, %34, %35, %36;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x112x32_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x112x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x112x32_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[28];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %35, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n112k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      "{%28, %29, %30, %31},"
+      " %32,"
+      " %33, %34,"
+      " p,   %36, %37, %38;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x112x32_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x120x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x120x32_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[30];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %34, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n120k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29},"
+      " %30,"
+      " %31,"
+      " %32, %33,"
+      " p,   %35, %36, %37, %38;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x120x32_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x120x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x120x32_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[30];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %37, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n120k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29},"
+      "{%30, %31, %32, %33},"
+      " %34,"
+      " %35, %36,"
+      " p,   %38, %39, %40;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x120x32_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x136x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x136x32_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[34];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %38, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n136k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33},"
+      " %34,"
+      " %35,"
+      " %36, %37,"
+      " p,   %39, %40, %41, %42;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x136x32_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x136x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x136x32_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[34];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %41, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n136k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33},"
+      "{%34, %35, %36, %37},"
+      " %38,"
+      " %39, %40,"
+      " p,   %42, %43, %44;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x136x32_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x144x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x144x32_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[36];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %40, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n144k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      " %36,"
+      " %37,"
+      " %38, %39,"
+      " p,   %41, %42, %43, %44;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x144x32_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x144x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x144x32_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[36];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %43, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n144k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      "{%36, %37, %38, %39},"
+      " %40,"
+      " %41, %42,"
+      " p,   %44, %45, %46;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x144x32_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x152x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x152x32_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[38];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %42, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n152k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37},"
+      " %38,"
+      " %39,"
+      " %40, %41,"
+      " p,   %43, %44, %45, %46;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x152x32_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x152x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x152x32_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[38];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %45, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n152k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37},"
+      "{%38, %39, %40, %41},"
+      " %42,"
+      " %43, %44,"
+      " p,   %46, %47, %48;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x152x32_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x160x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x160x32_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %44, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n160k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      " %40,"
+      " %41,"
+      " %42, %43,"
+      " p,   %45, %46, %47, %48;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x160x32_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x160x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x160x32_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %47, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n160k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      "{%40, %41, %42, %43},"
+      " %44,"
+      " %45, %46,"
+      " p,   %48, %49, %50;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x160x32_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x168x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x168x32_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[42];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %46, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n168k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41},"
+      " %42,"
+      " %43,"
+      " %44, %45,"
+      " p,   %47, %48, %49, %50;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x168x32_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x168x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x168x32_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[42];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %49, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n168k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41},"
+      "{%42, %43, %44, %45},"
+      " %46,"
+      " %47, %48,"
+      " p,   %50, %51, %52;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x168x32_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x176x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x176x32_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[44];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %48, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n176k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      " %44,"
+      " %45,"
+      " %46, %47,"
+      " p,   %49, %50, %51, %52;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x176x32_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x176x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x176x32_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[44];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %51, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n176k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      "{%44, %45, %46, %47},"
+      " %48,"
+      " %49, %50,"
+      " p,   %52, %53, %54;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x176x32_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x184x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x184x32_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[46];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %50, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n184k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45},"
+      " %46,"
+      " %47,"
+      " %48, %49,"
+      " p,   %51, %52, %53, %54;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x184x32_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x184x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x184x32_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[46];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %53, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n184k32.f16.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45},"
+      "{%46, %47, %48, %49},"
+      " %50,"
+      " %51, %52,"
+      " p,   %54, %55, %56;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x184x32_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x200x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x200x32_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[50];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %54, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n200k32.f16.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49},"
+      " %50,"
+      " %51,"
+      " %52, %53,"
+      " p,    %55,  %56,  %57,  %58;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x200x32_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x200x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x200x32_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[50];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %57, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n200k32.f16.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49},"
+      "{%50,  %51,  %52,  %53},"
+      " %54,"
+      " %55, %56,"
+      " p,    %58,  %59,  %60;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x200x32_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x208x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x208x32_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[52];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %56, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n208k32.f16.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      " %52,"
+      " %53,"
+      " %54, %55,"
+      " p,    %57,  %58,  %59,  %60;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x208x32_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x208x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x208x32_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[52];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %59, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n208k32.f16.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      "{%52,  %53,  %54,  %55},"
+      " %56,"
+      " %57, %58,"
+      " p,    %60,  %61,  %62;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x208x32_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x216x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x216x32_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[54];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %58, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n216k32.f16.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53},"
+      " %54,"
+      " %55,"
+      " %56, %57,"
+      " p,    %59,  %60,  %61,  %62;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x216x32_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x216x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x216x32_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[54];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %61, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n216k32.f16.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53},"
+      "{%54,  %55,  %56,  %57},"
+      " %58,"
+      " %59, %60,"
+      " p,    %62,  %63,  %64;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x216x32_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x224x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x224x32_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %60, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n224k32.f16.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      " %56,"
+      " %57,"
+      " %58, %59,"
+      " p,    %61,  %62,  %63,  %64;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x224x32_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x224x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x224x32_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %63, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n224k32.f16.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      "{%56,  %57,  %58,  %59},"
+      " %60,"
+      " %61, %62,"
+      " p,    %64,  %65,  %66;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x224x32_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x232x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x232x32_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[58];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %62, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n232k32.f16.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57},"
+      " %58,"
+      " %59,"
+      " %60, %61,"
+      " p,    %63,  %64,  %65,  %66;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x232x32_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x232x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x232x32_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[58];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %65, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n232k32.f16.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57},"
+      "{%58,  %59,  %60,  %61},"
+      " %62,"
+      " %63, %64,"
+      " p,    %66,  %67,  %68;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x232x32_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x240x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x240x32_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[60];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %64, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n240k32.f16.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      " %60,"
+      " %61,"
+      " %62, %63,"
+      " p,    %65,  %66,  %67,  %68;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x240x32_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x240x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x240x32_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[60];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %67, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n240k32.f16.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      "{%60,  %61,  %62,  %63},"
+      " %64,"
+      " %65, %66,"
+      " p,    %68,  %69,  %70;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x240x32_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x248x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x248x32_F16F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[62];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %66, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n248k32.f16.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61},"
+      " %62,"
+      " %63,"
+      " %64, %65,"
+      " p,    %67,  %68,  %69,  %70;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x248x32_F16F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x248x32 F16+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x248x32_F16F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[62];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %69, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n248k32.f16.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61},"
+      "{%62,  %63,  %64,  %65},"
+      " %66,"
+      " %67, %68,"
+      " p,    %70,  %71,  %72;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x248x32_F16F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x24x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x24x32_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %16, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n24k32.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      " %12,"
+      " %13,"
+      " %14, %15,"
+      " p,   %17, %18, %19, %20;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x24x32_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x24x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x24x32_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[12];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %19, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n24k32.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      "{%12, %13, %14, %15},"
+      " %16,"
+      " %17, %18,"
+      " p,   %20, %21, %22;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x24x32_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x40x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x40x32_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[20];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %24, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n40k32.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      " %20,"
+      " %21,"
+      " %22, %23,"
+      " p,   %25, %26, %27, %28;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x40x32_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x40x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x40x32_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[20];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %27, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n40k32.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      "{%20, %21, %22, %23},"
+      " %24,"
+      " %25, %26,"
+      " p,   %28, %29, %30;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x40x32_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x48x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x48x32_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %28, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n48k32.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      " %24,"
+      " %25,"
+      " %26, %27,"
+      " p,   %29, %30, %31, %32;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x48x32_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x48x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x48x32_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[24];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %31, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n48k32.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      "{%24, %25, %26, %27},"
+      " %28,"
+      " %29, %30,"
+      " p,   %32, %33, %34;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x48x32_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x56x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x56x32_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[28];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %32, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n56k32.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      " %28,"
+      " %29,"
+      " %30, %31,"
+      " p,   %33, %34, %35, %36;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x56x32_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x56x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x56x32_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[28];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %35, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n56k32.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      "{%28, %29, %30, %31},"
+      " %32,"
+      " %33, %34,"
+      " p,   %36, %37, %38;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x56x32_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x72x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x72x32_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[36];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %40, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n72k32.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      " %36,"
+      " %37,"
+      " %38, %39,"
+      " p,   %41, %42, %43, %44;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x72x32_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x72x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x72x32_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[36];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %43, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n72k32.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      "{%36, %37, %38, %39},"
+      " %40,"
+      " %41, %42,"
+      " p,   %44, %45, %46;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x72x32_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x80x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x80x32_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %44, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n80k32.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      " %40,"
+      " %41,"
+      " %42, %43,"
+      " p,   %45, %46, %47, %48;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x80x32_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x80x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x80x32_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[40];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %47, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n80k32.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      "{%40, %41, %42, %43},"
+      " %44,"
+      " %45, %46,"
+      " p,   %48, %49, %50;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x80x32_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x88x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x88x32_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[44];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %48, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n88k32.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      " %44,"
+      " %45,"
+      " %46, %47,"
+      " p,   %49, %50, %51, %52;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x88x32_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x88x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x88x32_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[44];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %51, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n88k32.f32.f16.f16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      "{%44, %45, %46, %47},"
+      " %48,"
+      " %49, %50,"
+      " p,   %52, %53, %54;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x88x32_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x104x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x104x32_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[52];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %56, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n104k32.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      " %52,"
+      " %53,"
+      " %54, %55,"
+      " p,    %57,  %58,  %59,  %60;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x104x32_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x104x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x104x32_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[52];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %59, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n104k32.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      "{%52,  %53,  %54,  %55},"
+      " %56,"
+      " %57, %58,"
+      " p,    %60,  %61,  %62;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x104x32_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x112x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x112x32_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %60, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n112k32.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      " %56,"
+      " %57,"
+      " %58, %59,"
+      " p,    %61,  %62,  %63,  %64;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x112x32_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x112x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x112x32_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[56];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %63, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n112k32.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      "{%56,  %57,  %58,  %59},"
+      " %60,"
+      " %61, %62,"
+      " p,    %64,  %65,  %66;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x112x32_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x120x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x120x32_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[60];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %64, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n120k32.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      " %60,"
+      " %61,"
+      " %62, %63,"
+      " p,    %65,  %66,  %67,  %68;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x120x32_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x120x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x120x32_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[60];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %67, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n120k32.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      "{%60,  %61,  %62,  %63},"
+      " %64,"
+      " %65, %66,"
+      " p,    %68,  %69,  %70;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x120x32_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x136x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x136x32_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[68];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %72, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n136k32.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67},"
+      " %68,"
+      " %69,"
+      " %70, %71,"
+      " p,    %73,  %74,  %75,  %76;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x136x32_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x136x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x136x32_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[68];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %75, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n136k32.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67},"
+      "{%68,  %69,  %70,  %71},"
+      " %72,"
+      " %73, %74,"
+      " p,    %76,  %77,  %78;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x136x32_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x144x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x144x32_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %76, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n144k32.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      " %72,"
+      " %73,"
+      " %74, %75,"
+      " p,    %77,  %78,  %79,  %80;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x144x32_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x144x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x144x32_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[72];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %79, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n144k32.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      "{%72,  %73,  %74,  %75},"
+      " %76,"
+      " %77, %78,"
+      " p,    %80,  %81,  %82;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x144x32_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x152x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x152x32_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[76];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %80, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n152k32.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75},"
+      " %76,"
+      " %77,"
+      " %78, %79,"
+      " p,    %81,  %82,  %83,  %84;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x152x32_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x152x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x152x32_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[76];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %83, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n152k32.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75},"
+      "{%76,  %77,  %78,  %79},"
+      " %80,"
+      " %81, %82,"
+      " p,    %84,  %85,  %86;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x152x32_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x160x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x160x32_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %84, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n160k32.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      " %80,"
+      " %81,"
+      " %82, %83,"
+      " p,    %85,  %86,  %87,  %88;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x160x32_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x160x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x160x32_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[80];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %87, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n160k32.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      "{%80,  %81,  %82,  %83},"
+      " %84,"
+      " %85, %86,"
+      " p,    %88,  %89,  %90;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x160x32_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x168x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x168x32_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[84];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %88, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n168k32.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83},"
+      " %84,"
+      " %85,"
+      " %86, %87,"
+      " p,    %89,  %90,  %91,  %92;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x168x32_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x168x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x168x32_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[84];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %91, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n168k32.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83},"
+      "{%84,  %85,  %86,  %87},"
+      " %88,"
+      " %89, %90,"
+      " p,    %92,  %93,  %94;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x168x32_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x176x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x176x32_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %92, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n176k32.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      " %88,"
+      " %89,"
+      " %90, %91,"
+      " p,    %93,  %94,  %95,  %96;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x176x32_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x176x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x176x32_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[88];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %95, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n176k32.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      "{%88,  %89,  %90,  %91},"
+      " %92,"
+      " %93, %94,"
+      " p,    %96,  %97,  %98;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x176x32_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x184x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x184x32_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[92];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %96, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n184k32.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91},"
+      " %92,"
+      " %93,"
+      " %94, %95,"
+      " p,    %97,  %98,  %99,  %100;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x184x32_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x184x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x184x32_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[92];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %99, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n184k32.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91},"
+      "{%92,  %93,  %94,  %95},"
+      " %96,"
+      " %97, %98,"
+      " p,    %100, %101, %102;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x184x32_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x200x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x200x32_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[100];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %104, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n200k32.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99},"
+      " %100,"
+      " %101,"
+      " %102, %103,"
+      " p,    %105, %106, %107, %108;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x200x32_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x200x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x200x32_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[100];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %107, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n200k32.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99},"
+      "{%100, %101, %102, %103},"
+      " %104,"
+      " %105, %106,"
+      " p,    %108, %109, %110;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x200x32_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x208x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x208x32_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %108, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n208k32.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      " %104,"
+      " %105,"
+      " %106, %107,"
+      " p,    %109, %110, %111, %112;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x208x32_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x208x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x208x32_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[104];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %111, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n208k32.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      "{%104, %105, %106, %107},"
+      " %108,"
+      " %109, %110,"
+      " p,    %112, %113, %114;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x208x32_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x216x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x216x32_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[108];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %112, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n216k32.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107},"
+      " %108,"
+      " %109,"
+      " %110, %111,"
+      " p,    %113, %114, %115, %116;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x216x32_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x216x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x216x32_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[108];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %115, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n216k32.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107},"
+      "{%108, %109, %110, %111},"
+      " %112,"
+      " %113, %114,"
+      " p,    %116, %117, %118;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x216x32_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x224x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x224x32_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %116, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n224k32.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      " %112,"
+      " %113,"
+      " %114, %115,"
+      " p,    %117, %118, %119, %120;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x224x32_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x224x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x224x32_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[112];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %119, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n224k32.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      "{%112, %113, %114, %115},"
+      " %116,"
+      " %117, %118,"
+      " p,    %120, %121, %122;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x224x32_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x232x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x232x32_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[116];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %120, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n232k32.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115},"
+      " %116,"
+      " %117,"
+      " %118, %119,"
+      " p,    %121, %122, %123, %124;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x232x32_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x232x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x232x32_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[116];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %123, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n232k32.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115},"
+      "{%116, %117, %118, %119},"
+      " %120,"
+      " %121, %122,"
+      " p,    %124, %125, %126;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x232x32_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x240x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x240x32_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %124, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n240k32.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      " %120,"
+      " %121,"
+      " %122, %123,"
+      " p,    %125, %126, %127, %128;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x240x32_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x240x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x240x32_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[120];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %127, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n240k32.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      "{%120, %121, %122, %123},"
+      " %124,"
+      " %125, %126,"
+      " p,    %128, %129, %130;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x240x32_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x248x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x248x32_F32F16F16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[124];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %128, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n248k32.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123},"
+      " %124,"
+      " %125,"
+      " %126, %127,"
+      " p,    %129, %130, %131, %132;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x248x32_F32F16F16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x248x32 F32+=F16*F16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x248x32_F32F16F16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[124];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %131, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n248k32.f32.f16.f16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123},"
+      "{%124, %125, %126, %127},"
+      " %128,"
+      " %129, %130,"
+      " p,    %132, %133, %134;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x248x32_F32F16F16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x24x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x24x32_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %16, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n24k32.f32.bf16.bf16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      " %12,"
+      " %13,"
+      " %14, %15,"
+      " p,   %17, %18, %19, %20;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x24x32_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x24x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x24x32_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[12];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %19, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n24k32.f32.bf16.bf16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      "{%12, %13, %14, %15},"
+      " %16,"
+      " %17, %18,"
+      " p,   %20, %21, %22;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x24x32_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x40x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x40x32_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[20];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %24, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n40k32.f32.bf16.bf16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      " %20,"
+      " %21,"
+      " %22, %23,"
+      " p,   %25, %26, %27, %28;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x40x32_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x40x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x40x32_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[20];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %27, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n40k32.f32.bf16.bf16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      "{%20, %21, %22, %23},"
+      " %24,"
+      " %25, %26,"
+      " p,   %28, %29, %30;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x40x32_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x48x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x48x32_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %28, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n48k32.f32.bf16.bf16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      " %24,"
+      " %25,"
+      " %26, %27,"
+      " p,   %29, %30, %31, %32;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x48x32_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x48x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x48x32_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[24];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %31, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n48k32.f32.bf16.bf16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      "{%24, %25, %26, %27},"
+      " %28,"
+      " %29, %30,"
+      " p,   %32, %33, %34;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x48x32_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x56x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x56x32_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[28];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %32, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n56k32.f32.bf16.bf16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      " %28,"
+      " %29,"
+      " %30, %31,"
+      " p,   %33, %34, %35, %36;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x56x32_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x56x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x56x32_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[28];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %35, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n56k32.f32.bf16.bf16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      "{%28, %29, %30, %31},"
+      " %32,"
+      " %33, %34,"
+      " p,   %36, %37, %38;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x56x32_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x72x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x72x32_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[36];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %40, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n72k32.f32.bf16.bf16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      " %36,"
+      " %37,"
+      " %38, %39,"
+      " p,   %41, %42, %43, %44;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x72x32_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x72x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x72x32_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[36];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %43, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n72k32.f32.bf16.bf16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      "{%36, %37, %38, %39},"
+      " %40,"
+      " %41, %42,"
+      " p,   %44, %45, %46;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x72x32_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x80x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x80x32_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %44, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n80k32.f32.bf16.bf16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      " %40,"
+      " %41,"
+      " %42, %43,"
+      " p,   %45, %46, %47, %48;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x80x32_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x80x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x80x32_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[40];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %47, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n80k32.f32.bf16.bf16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      "{%40, %41, %42, %43},"
+      " %44,"
+      " %45, %46,"
+      " p,   %48, %49, %50;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x80x32_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x88x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x88x32_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[44];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %48, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n88k32.f32.bf16.bf16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      " %44,"
+      " %45,"
+      " %46, %47,"
+      " p,   %49, %50, %51, %52;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x88x32_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x88x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x88x32_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[44];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %51, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n88k32.f32.bf16.bf16 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      "{%44, %45, %46, %47},"
+      " %48,"
+      " %49, %50,"
+      " p,   %52, %53, %54;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x88x32_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x104x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x104x32_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[52];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %56, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n104k32.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      " %52,"
+      " %53,"
+      " %54, %55,"
+      " p,    %57,  %58,  %59,  %60;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x104x32_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x104x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x104x32_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[52];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %59, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n104k32.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      "{%52,  %53,  %54,  %55},"
+      " %56,"
+      " %57, %58,"
+      " p,    %60,  %61,  %62;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x104x32_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x112x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x112x32_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %60, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n112k32.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      " %56,"
+      " %57,"
+      " %58, %59,"
+      " p,    %61,  %62,  %63,  %64;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x112x32_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x112x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x112x32_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[56];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %63, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n112k32.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      "{%56,  %57,  %58,  %59},"
+      " %60,"
+      " %61, %62,"
+      " p,    %64,  %65,  %66;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x112x32_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x120x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x120x32_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[60];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %64, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n120k32.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      " %60,"
+      " %61,"
+      " %62, %63,"
+      " p,    %65,  %66,  %67,  %68;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x120x32_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x120x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x120x32_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[60];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %67, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n120k32.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      "{%60,  %61,  %62,  %63},"
+      " %64,"
+      " %65, %66,"
+      " p,    %68,  %69,  %70;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x120x32_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x136x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x136x32_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[68];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %72, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n136k32.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67},"
+      " %68,"
+      " %69,"
+      " %70, %71,"
+      " p,    %73,  %74,  %75,  %76;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x136x32_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x136x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x136x32_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[68];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %75, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n136k32.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67},"
+      "{%68,  %69,  %70,  %71},"
+      " %72,"
+      " %73, %74,"
+      " p,    %76,  %77,  %78;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x136x32_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x144x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x144x32_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %76, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n144k32.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      " %72,"
+      " %73,"
+      " %74, %75,"
+      " p,    %77,  %78,  %79,  %80;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x144x32_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x144x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x144x32_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[72];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %79, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n144k32.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      "{%72,  %73,  %74,  %75},"
+      " %76,"
+      " %77, %78,"
+      " p,    %80,  %81,  %82;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x144x32_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x152x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x152x32_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[76];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %80, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n152k32.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75},"
+      " %76,"
+      " %77,"
+      " %78, %79,"
+      " p,    %81,  %82,  %83,  %84;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x152x32_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x152x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x152x32_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[76];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %83, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n152k32.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75},"
+      "{%76,  %77,  %78,  %79},"
+      " %80,"
+      " %81, %82,"
+      " p,    %84,  %85,  %86;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x152x32_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x160x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x160x32_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %84, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n160k32.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      " %80,"
+      " %81,"
+      " %82, %83,"
+      " p,    %85,  %86,  %87,  %88;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x160x32_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x160x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x160x32_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[80];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %87, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n160k32.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      "{%80,  %81,  %82,  %83},"
+      " %84,"
+      " %85, %86,"
+      " p,    %88,  %89,  %90;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x160x32_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x168x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x168x32_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[84];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %88, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n168k32.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83},"
+      " %84,"
+      " %85,"
+      " %86, %87,"
+      " p,    %89,  %90,  %91,  %92;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x168x32_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x168x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x168x32_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[84];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %91, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n168k32.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83},"
+      "{%84,  %85,  %86,  %87},"
+      " %88,"
+      " %89, %90,"
+      " p,    %92,  %93,  %94;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x168x32_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x176x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x176x32_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %92, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n176k32.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      " %88,"
+      " %89,"
+      " %90, %91,"
+      " p,    %93,  %94,  %95,  %96;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x176x32_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x176x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x176x32_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[88];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %95, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n176k32.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      "{%88,  %89,  %90,  %91},"
+      " %92,"
+      " %93, %94,"
+      " p,    %96,  %97,  %98;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x176x32_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x184x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x184x32_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[92];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %96, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n184k32.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91},"
+      " %92,"
+      " %93,"
+      " %94, %95,"
+      " p,    %97,  %98,  %99,  %100;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x184x32_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x184x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x184x32_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[92];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %99, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n184k32.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91},"
+      "{%92,  %93,  %94,  %95},"
+      " %96,"
+      " %97, %98,"
+      " p,    %100, %101, %102;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x184x32_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x200x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x200x32_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[100];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %104, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n200k32.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99},"
+      " %100,"
+      " %101,"
+      " %102, %103,"
+      " p,    %105, %106, %107, %108;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x200x32_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x200x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x200x32_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[100];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %107, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n200k32.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99},"
+      "{%100, %101, %102, %103},"
+      " %104,"
+      " %105, %106,"
+      " p,    %108, %109, %110;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x200x32_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x208x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x208x32_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %108, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n208k32.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      " %104,"
+      " %105,"
+      " %106, %107,"
+      " p,    %109, %110, %111, %112;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x208x32_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x208x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x208x32_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[104];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %111, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n208k32.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      "{%104, %105, %106, %107},"
+      " %108,"
+      " %109, %110,"
+      " p,    %112, %113, %114;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x208x32_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x216x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x216x32_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[108];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %112, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n216k32.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107},"
+      " %108,"
+      " %109,"
+      " %110, %111,"
+      " p,    %113, %114, %115, %116;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x216x32_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x216x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x216x32_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[108];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %115, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n216k32.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107},"
+      "{%108, %109, %110, %111},"
+      " %112,"
+      " %113, %114,"
+      " p,    %116, %117, %118;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x216x32_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x224x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x224x32_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %116, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n224k32.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      " %112,"
+      " %113,"
+      " %114, %115,"
+      " p,    %117, %118, %119, %120;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x224x32_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x224x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x224x32_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[112];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %119, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n224k32.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      "{%112, %113, %114, %115},"
+      " %116,"
+      " %117, %118,"
+      " p,    %120, %121, %122;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x224x32_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x232x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x232x32_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[116];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %120, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n232k32.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115},"
+      " %116,"
+      " %117,"
+      " %118, %119,"
+      " p,    %121, %122, %123, %124;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x232x32_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x232x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x232x32_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[116];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %123, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n232k32.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115},"
+      "{%116, %117, %118, %119},"
+      " %120,"
+      " %121, %122,"
+      " p,    %124, %125, %126;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x232x32_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x240x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x240x32_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %124, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n240k32.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      " %120,"
+      " %121,"
+      " %122, %123,"
+      " p,    %125, %126, %127, %128;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x240x32_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x240x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x240x32_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[120];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %127, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n240k32.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      "{%120, %121, %122, %123},"
+      " %124,"
+      " %125, %126,"
+      " p,    %128, %129, %130;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x240x32_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x248x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x248x32_F32BF16BF16_SS
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[124];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %128, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n248k32.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123},"
+      " %124,"
+      " %125,"
+      " %126, %127,"
+      " p,    %129, %130, %131, %132;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspA)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x248x32_F32BF16BF16_SS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x248x32 F32+=BF16*BF16
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x248x32_F32BF16BF16_RS
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[124];
+
+  static_assert(tnspA == GMMA::Major::K,
+      "Register source operand A must have K major layout.");
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %131, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n248k32.f32.bf16.bf16 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123},"
+      "{%124, %125, %126, %127},"
+      " %128,"
+      " %129, %130,"
+      " p,    %132, %133, %134;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)), "n"(int32_t(tnspB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x248x32_F32BF16BF16_RS without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x24x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x24x16_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %16, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n24k16.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      " %12,"
+      " %13,"
+      " %14, %15,"
+      " p,   %17, %18;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x24x16_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x24x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x24x16_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %19, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n24k16.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      "{%12, %13, %14, %15},"
+      " %16,"
+      " %17, %18,"
+      " p,   %20, %21;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x24x16_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x40x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x40x16_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[20];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %24, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n40k16.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      " %20,"
+      " %21,"
+      " %22, %23,"
+      " p,   %25, %26;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x40x16_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x40x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x40x16_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[20];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %27, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n40k16.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      "{%20, %21, %22, %23},"
+      " %24,"
+      " %25, %26,"
+      " p,   %28, %29;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x40x16_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x48x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x48x16_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %28, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n48k16.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      " %24,"
+      " %25,"
+      " %26, %27,"
+      " p,   %29, %30;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x48x16_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x48x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x48x16_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %31, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n48k16.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      "{%24, %25, %26, %27},"
+      " %28,"
+      " %29, %30,"
+      " p,   %32, %33;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x48x16_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x56x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x56x16_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[28];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %32, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n56k16.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      " %28,"
+      " %29,"
+      " %30, %31,"
+      " p,   %33, %34;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x56x16_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x56x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x56x16_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[28];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %35, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n56k16.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      "{%28, %29, %30, %31},"
+      " %32,"
+      " %33, %34,"
+      " p,   %36, %37;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x56x16_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x72x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x72x16_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[36];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %40, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n72k16.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      " %36,"
+      " %37,"
+      " %38, %39,"
+      " p,   %41, %42;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x72x16_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x72x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x72x16_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[36];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %43, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n72k16.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      "{%36, %37, %38, %39},"
+      " %40,"
+      " %41, %42,"
+      " p,   %44, %45;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x72x16_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x80x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x80x16_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %44, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n80k16.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      " %40,"
+      " %41,"
+      " %42, %43,"
+      " p,   %45, %46;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x80x16_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x80x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x80x16_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %47, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n80k16.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      "{%40, %41, %42, %43},"
+      " %44,"
+      " %45, %46,"
+      " p,   %48, %49;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x80x16_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x88x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x88x16_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[44];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %48, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n88k16.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      " %44,"
+      " %45,"
+      " %46, %47,"
+      " p,   %49, %50;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x88x16_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x88x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x88x16_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[44];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %51, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n88k16.f32.tf32.tf32 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      "{%44, %45, %46, %47},"
+      " %48,"
+      " %49, %50,"
+      " p,   %52, %53;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x88x16_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x104x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x104x16_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[52];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %56, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n104k16.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      " %52,"
+      " %53,"
+      " %54, %55,"
+      " p,    %57,  %58;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x104x16_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x104x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x104x16_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[52];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %59, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n104k16.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      "{%52,  %53,  %54,  %55},"
+      " %56,"
+      " %57, %58,"
+      " p,    %60,  %61;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x104x16_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x112x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x112x16_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %60, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n112k16.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      " %56,"
+      " %57,"
+      " %58, %59,"
+      " p,    %61,  %62;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x112x16_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x112x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x112x16_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %63, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n112k16.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      "{%56,  %57,  %58,  %59},"
+      " %60,"
+      " %61, %62,"
+      " p,    %64,  %65;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x112x16_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x120x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x120x16_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[60];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %64, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n120k16.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      " %60,"
+      " %61,"
+      " %62, %63,"
+      " p,    %65,  %66;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x120x16_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x120x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x120x16_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[60];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %67, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n120k16.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      "{%60,  %61,  %62,  %63},"
+      " %64,"
+      " %65, %66,"
+      " p,    %68,  %69;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x120x16_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x136x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x136x16_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[68];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %72, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n136k16.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67},"
+      " %68,"
+      " %69,"
+      " %70, %71,"
+      " p,    %73,  %74;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x136x16_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x136x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x136x16_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[68];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %75, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n136k16.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67},"
+      "{%68,  %69,  %70,  %71},"
+      " %72,"
+      " %73, %74,"
+      " p,    %76,  %77;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x136x16_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x144x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x144x16_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %76, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n144k16.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      " %72,"
+      " %73,"
+      " %74, %75,"
+      " p,    %77,  %78;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x144x16_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x144x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x144x16_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %79, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n144k16.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      "{%72,  %73,  %74,  %75},"
+      " %76,"
+      " %77, %78,"
+      " p,    %80,  %81;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x144x16_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x152x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x152x16_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[76];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %80, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n152k16.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75},"
+      " %76,"
+      " %77,"
+      " %78, %79,"
+      " p,    %81,  %82;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x152x16_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x152x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x152x16_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[76];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %83, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n152k16.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75},"
+      "{%76,  %77,  %78,  %79},"
+      " %80,"
+      " %81, %82,"
+      " p,    %84,  %85;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x152x16_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x160x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x160x16_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %84, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n160k16.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      " %80,"
+      " %81,"
+      " %82, %83,"
+      " p,    %85,  %86;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x160x16_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x160x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x160x16_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %87, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n160k16.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      "{%80,  %81,  %82,  %83},"
+      " %84,"
+      " %85, %86,"
+      " p,    %88,  %89;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x160x16_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x168x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x168x16_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[84];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %88, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n168k16.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83},"
+      " %84,"
+      " %85,"
+      " %86, %87,"
+      " p,    %89,  %90;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x168x16_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x168x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x168x16_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[84];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %91, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n168k16.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83},"
+      "{%84,  %85,  %86,  %87},"
+      " %88,"
+      " %89, %90,"
+      " p,    %92,  %93;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x168x16_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x176x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x176x16_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %92, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n176k16.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      " %88,"
+      " %89,"
+      " %90, %91,"
+      " p,    %93,  %94;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x176x16_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x176x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x176x16_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %95, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n176k16.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      "{%88,  %89,  %90,  %91},"
+      " %92,"
+      " %93, %94,"
+      " p,    %96,  %97;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x176x16_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x184x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x184x16_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[92];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %96, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n184k16.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91},"
+      " %92,"
+      " %93,"
+      " %94, %95,"
+      " p,    %97,  %98;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x184x16_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x184x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x184x16_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[92];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %99, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n184k16.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91},"
+      "{%92,  %93,  %94,  %95},"
+      " %96,"
+      " %97, %98,"
+      " p,    %100, %101;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x184x16_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x200x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x200x16_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[100];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %104, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n200k16.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99},"
+      " %100,"
+      " %101,"
+      " %102, %103,"
+      " p,    %105, %106;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x200x16_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x200x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x200x16_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[100];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %107, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n200k16.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99},"
+      "{%100, %101, %102, %103},"
+      " %104,"
+      " %105, %106,"
+      " p,    %108, %109;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x200x16_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x208x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x208x16_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %108, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n208k16.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      " %104,"
+      " %105,"
+      " %106, %107,"
+      " p,    %109, %110;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x208x16_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x208x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x208x16_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %111, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n208k16.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      "{%104, %105, %106, %107},"
+      " %108,"
+      " %109, %110,"
+      " p,    %112, %113;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x208x16_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x216x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x216x16_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[108];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %112, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n216k16.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107},"
+      " %108,"
+      " %109,"
+      " %110, %111,"
+      " p,    %113, %114;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x216x16_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x216x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x216x16_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[108];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %115, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n216k16.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107},"
+      "{%108, %109, %110, %111},"
+      " %112,"
+      " %113, %114,"
+      " p,    %116, %117;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x216x16_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x224x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x224x16_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %116, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n224k16.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      " %112,"
+      " %113,"
+      " %114, %115,"
+      " p,    %117, %118;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x224x16_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x224x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x224x16_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %119, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n224k16.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      "{%112, %113, %114, %115},"
+      " %116,"
+      " %117, %118,"
+      " p,    %120, %121;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x224x16_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x232x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x232x16_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[116];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %120, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n232k16.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115},"
+      " %116,"
+      " %117,"
+      " %118, %119,"
+      " p,    %121, %122;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x232x16_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x232x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x232x16_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[116];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %123, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n232k16.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115},"
+      "{%116, %117, %118, %119},"
+      " %120,"
+      " %121, %122,"
+      " p,    %124, %125;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x232x16_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x240x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x240x16_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %124, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n240k16.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      " %120,"
+      " %121,"
+      " %122, %123,"
+      " p,    %125, %126;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x240x16_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x240x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x240x16_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %127, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n240k16.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      "{%120, %121, %122, %123},"
+      " %124,"
+      " %125, %126,"
+      " p,    %128, %129;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x240x16_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x248x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x248x16_F32TF32TF32_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[124];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %128, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n248k16.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123},"
+      " %124,"
+      " %125,"
+      " %126, %127,"
+      " p,    %129, %130;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x248x16_F32TF32TF32_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x248x16 TN F32+=TF32*TF32
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x248x16_F32TF32TF32_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[124];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %131, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n248k16.f32.tf32.tf32 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123},"
+      "{%124, %125, %126, %127},"
+      " %128,"
+      " %129, %130,"
+      " p,    %132, %133;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x248x16_F32TF32TF32_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x24x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x24x64_S32S8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %16, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n24k64.s32.s8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      " %12,"
+      " %13,"
+      " %14, %15,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x24x64_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x24x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x24x64_S32S8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %16, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n24k64.s32.s8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      " %12,"
+      " %13,"
+      " %14, %15,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x24x64_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x48x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x48x64_S32S8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %28, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n48k64.s32.s8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      " %24,"
+      " %25,"
+      " %26, %27,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x48x64_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x48x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x48x64_S32S8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %28, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n48k64.s32.s8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      " %24,"
+      " %25,"
+      " %26, %27,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x48x64_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x80x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x80x64_S32S8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %44, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n80k64.s32.s8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      " %40,"
+      " %41,"
+      " %42, %43,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x80x64_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x80x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x80x64_S32S8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %44, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n80k64.s32.s8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      " %40,"
+      " %41,"
+      " %42, %43,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x80x64_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x112x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x112x64_S32S8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %60, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n112k64.s32.s8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      " %56,"
+      " %57,"
+      " %58, %59,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x112x64_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x112x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x112x64_S32S8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %60, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n112k64.s32.s8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      " %56,"
+      " %57,"
+      " %58, %59,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x112x64_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x144x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x144x64_S32S8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %76, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n144k64.s32.s8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      " %72,"
+      " %73,"
+      " %74, %75,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x144x64_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x144x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x144x64_S32S8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %76, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n144k64.s32.s8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      " %72,"
+      " %73,"
+      " %74, %75,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x144x64_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x160x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x160x64_S32S8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %84, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n160k64.s32.s8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      " %80,"
+      " %81,"
+      " %82, %83,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x160x64_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x160x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x160x64_S32S8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %84, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n160k64.s32.s8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      " %80,"
+      " %81,"
+      " %82, %83,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x160x64_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x176x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x176x64_S32S8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %92, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n176k64.s32.s8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      " %88,"
+      " %89,"
+      " %90, %91,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x176x64_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x176x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x176x64_S32S8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %92, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n176k64.s32.s8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      " %88,"
+      " %89,"
+      " %90, %91,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x176x64_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x208x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x208x64_S32S8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %108, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n208k64.s32.s8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      " %104,"
+      " %105,"
+      " %106, %107,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x208x64_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x208x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x208x64_S32S8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %108, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n208k64.s32.s8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      " %104,"
+      " %105,"
+      " %106, %107,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x208x64_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x224x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x224x64_S32S8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %116, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n224k64.s32.s8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      " %112,"
+      " %113,"
+      " %114, %115,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x224x64_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x224x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x224x64_S32S8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %116, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n224k64.s32.s8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      " %112,"
+      " %113,"
+      " %114, %115,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x224x64_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x240x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x240x64_S32S8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %124, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n240k64.s32.s8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      " %120,"
+      " %121,"
+      " %122, %123,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x240x64_S32S8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x240x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x240x64_S32S8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %124, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n240k64.s32.s8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      " %120,"
+      " %121,"
+      " %122, %123,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x240x64_S32S8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x24x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x24x64_S32S8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %19, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n24k64.s32.s8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      "{%12, %13, %14, %15},"
+      " %16,"
+      " %17, %18,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x24x64_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x24x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x24x64_S32S8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %19, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n24k64.s32.s8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      "{%12, %13, %14, %15},"
+      " %16,"
+      " %17, %18,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x24x64_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x48x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x48x64_S32S8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %31, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n48k64.s32.s8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      "{%24, %25, %26, %27},"
+      " %28,"
+      " %29, %30,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x48x64_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x48x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x48x64_S32S8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %31, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n48k64.s32.s8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      "{%24, %25, %26, %27},"
+      " %28,"
+      " %29, %30,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x48x64_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x80x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x80x64_S32S8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %47, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n80k64.s32.s8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      "{%40, %41, %42, %43},"
+      " %44,"
+      " %45, %46,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x80x64_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x80x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x80x64_S32S8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %47, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n80k64.s32.s8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      "{%40, %41, %42, %43},"
+      " %44,"
+      " %45, %46,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x80x64_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x112x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x112x64_S32S8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %63, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n112k64.s32.s8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      "{%56,  %57,  %58,  %59},"
+      " %60,"
+      " %61, %62,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x112x64_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x112x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x112x64_S32S8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %63, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n112k64.s32.s8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      "{%56,  %57,  %58,  %59},"
+      " %60,"
+      " %61, %62,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x112x64_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x144x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x144x64_S32S8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %79, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n144k64.s32.s8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      "{%72,  %73,  %74,  %75},"
+      " %76,"
+      " %77, %78,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x144x64_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x144x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x144x64_S32S8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %79, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n144k64.s32.s8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      "{%72,  %73,  %74,  %75},"
+      " %76,"
+      " %77, %78,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x144x64_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x160x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x160x64_S32S8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %87, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n160k64.s32.s8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      "{%80,  %81,  %82,  %83},"
+      " %84,"
+      " %85, %86,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x160x64_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x160x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x160x64_S32S8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %87, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n160k64.s32.s8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      "{%80,  %81,  %82,  %83},"
+      " %84,"
+      " %85, %86,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x160x64_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x176x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x176x64_S32S8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %95, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n176k64.s32.s8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      "{%88,  %89,  %90,  %91},"
+      " %92,"
+      " %93, %94,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x176x64_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x176x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x176x64_S32S8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %95, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n176k64.s32.s8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      "{%88,  %89,  %90,  %91},"
+      " %92,"
+      " %93, %94,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x176x64_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x208x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x208x64_S32S8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %111, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n208k64.s32.s8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      "{%104, %105, %106, %107},"
+      " %108,"
+      " %109, %110,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x208x64_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x208x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x208x64_S32S8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %111, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n208k64.s32.s8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      "{%104, %105, %106, %107},"
+      " %108,"
+      " %109, %110,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x208x64_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x224x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x224x64_S32S8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %119, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n224k64.s32.s8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      "{%112, %113, %114, %115},"
+      " %116,"
+      " %117, %118,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x224x64_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x224x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x224x64_S32S8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %119, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n224k64.s32.s8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      "{%112, %113, %114, %115},"
+      " %116,"
+      " %117, %118,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x224x64_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x240x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x240x64_S32S8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %127, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n240k64.s32.s8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      "{%120, %121, %122, %123},"
+      " %124,"
+      " %125, %126,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x240x64_S32S8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x240x64 TN S32+=S8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x240x64_S32S8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %127, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n240k64.s32.s8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      "{%120, %121, %122, %123},"
+      " %124,"
+      " %125, %126,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x240x64_S32S8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x24x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x24x64_S32S8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %16, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n24k64.s32.s8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      " %12,"
+      " %13,"
+      " %14, %15,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x24x64_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x24x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x24x64_S32S8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %16, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n24k64.s32.s8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      " %12,"
+      " %13,"
+      " %14, %15,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x24x64_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x48x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x48x64_S32S8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %28, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n48k64.s32.s8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      " %24,"
+      " %25,"
+      " %26, %27,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x48x64_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x48x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x48x64_S32S8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %28, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n48k64.s32.s8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      " %24,"
+      " %25,"
+      " %26, %27,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x48x64_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x80x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x80x64_S32S8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %44, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n80k64.s32.s8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      " %40,"
+      " %41,"
+      " %42, %43,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x80x64_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x80x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x80x64_S32S8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %44, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n80k64.s32.s8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      " %40,"
+      " %41,"
+      " %42, %43,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x80x64_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x112x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x112x64_S32S8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %60, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n112k64.s32.s8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      " %56,"
+      " %57,"
+      " %58, %59,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x112x64_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x112x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x112x64_S32S8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %60, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n112k64.s32.s8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      " %56,"
+      " %57,"
+      " %58, %59,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x112x64_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x144x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x144x64_S32S8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %76, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n144k64.s32.s8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      " %72,"
+      " %73,"
+      " %74, %75,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x144x64_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x144x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x144x64_S32S8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %76, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n144k64.s32.s8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      " %72,"
+      " %73,"
+      " %74, %75,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x144x64_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x160x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x160x64_S32S8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %84, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n160k64.s32.s8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      " %80,"
+      " %81,"
+      " %82, %83,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x160x64_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x160x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x160x64_S32S8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %84, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n160k64.s32.s8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      " %80,"
+      " %81,"
+      " %82, %83,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x160x64_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x176x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x176x64_S32S8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %92, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n176k64.s32.s8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      " %88,"
+      " %89,"
+      " %90, %91,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x176x64_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x176x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x176x64_S32S8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %92, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n176k64.s32.s8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      " %88,"
+      " %89,"
+      " %90, %91,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x176x64_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x208x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x208x64_S32S8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %108, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n208k64.s32.s8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      " %104,"
+      " %105,"
+      " %106, %107,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x208x64_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x208x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x208x64_S32S8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %108, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n208k64.s32.s8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      " %104,"
+      " %105,"
+      " %106, %107,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x208x64_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x224x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x224x64_S32S8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %116, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n224k64.s32.s8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      " %112,"
+      " %113,"
+      " %114, %115,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x224x64_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x224x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x224x64_S32S8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %116, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n224k64.s32.s8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      " %112,"
+      " %113,"
+      " %114, %115,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x224x64_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x240x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x240x64_S32S8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %124, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n240k64.s32.s8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      " %120,"
+      " %121,"
+      " %122, %123,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x240x64_S32S8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x240x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x240x64_S32S8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %124, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n240k64.s32.s8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      " %120,"
+      " %121,"
+      " %122, %123,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x240x64_S32S8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x24x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x24x64_S32S8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %19, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n24k64.s32.s8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      "{%12, %13, %14, %15},"
+      " %16,"
+      " %17, %18,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x24x64_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x24x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x24x64_S32S8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %19, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n24k64.s32.s8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      "{%12, %13, %14, %15},"
+      " %16,"
+      " %17, %18,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x24x64_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x48x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x48x64_S32S8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %31, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n48k64.s32.s8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      "{%24, %25, %26, %27},"
+      " %28,"
+      " %29, %30,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x48x64_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x48x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x48x64_S32S8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %31, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n48k64.s32.s8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      "{%24, %25, %26, %27},"
+      " %28,"
+      " %29, %30,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x48x64_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x80x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x80x64_S32S8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %47, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n80k64.s32.s8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      "{%40, %41, %42, %43},"
+      " %44,"
+      " %45, %46,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x80x64_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x80x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x80x64_S32S8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %47, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n80k64.s32.s8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      "{%40, %41, %42, %43},"
+      " %44,"
+      " %45, %46,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x80x64_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x112x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x112x64_S32S8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %63, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n112k64.s32.s8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      "{%56,  %57,  %58,  %59},"
+      " %60,"
+      " %61, %62,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x112x64_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x112x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x112x64_S32S8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %63, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n112k64.s32.s8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      "{%56,  %57,  %58,  %59},"
+      " %60,"
+      " %61, %62,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x112x64_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x144x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x144x64_S32S8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %79, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n144k64.s32.s8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      "{%72,  %73,  %74,  %75},"
+      " %76,"
+      " %77, %78,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x144x64_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x144x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x144x64_S32S8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %79, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n144k64.s32.s8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      "{%72,  %73,  %74,  %75},"
+      " %76,"
+      " %77, %78,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x144x64_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x160x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x160x64_S32S8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %87, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n160k64.s32.s8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      "{%80,  %81,  %82,  %83},"
+      " %84,"
+      " %85, %86,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x160x64_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x160x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x160x64_S32S8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %87, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n160k64.s32.s8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      "{%80,  %81,  %82,  %83},"
+      " %84,"
+      " %85, %86,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x160x64_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x176x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x176x64_S32S8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %95, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n176k64.s32.s8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      "{%88,  %89,  %90,  %91},"
+      " %92,"
+      " %93, %94,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x176x64_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x176x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x176x64_S32S8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %95, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n176k64.s32.s8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      "{%88,  %89,  %90,  %91},"
+      " %92,"
+      " %93, %94,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x176x64_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x208x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x208x64_S32S8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %111, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n208k64.s32.s8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      "{%104, %105, %106, %107},"
+      " %108,"
+      " %109, %110,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x208x64_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x208x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x208x64_S32S8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %111, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n208k64.s32.s8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      "{%104, %105, %106, %107},"
+      " %108,"
+      " %109, %110,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x208x64_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x224x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x224x64_S32S8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %119, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n224k64.s32.s8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      "{%112, %113, %114, %115},"
+      " %116,"
+      " %117, %118,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x224x64_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x224x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x224x64_S32S8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %119, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n224k64.s32.s8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      "{%112, %113, %114, %115},"
+      " %116,"
+      " %117, %118,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x224x64_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x240x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x240x64_S32S8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %127, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n240k64.s32.s8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      "{%120, %121, %122, %123},"
+      " %124,"
+      " %125, %126,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x240x64_S32S8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x240x64 TN S32+=S8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x240x64_S32S8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %127, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n240k64.s32.s8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      "{%120, %121, %122, %123},"
+      " %124,"
+      " %125, %126,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x240x64_S32S8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x24x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x24x64_S32U8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %16, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n24k64.s32.u8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      " %12,"
+      " %13,"
+      " %14, %15,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x24x64_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x24x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x24x64_S32U8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %16, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n24k64.s32.u8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      " %12,"
+      " %13,"
+      " %14, %15,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x24x64_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x48x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x48x64_S32U8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %28, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n48k64.s32.u8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      " %24,"
+      " %25,"
+      " %26, %27,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x48x64_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x48x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x48x64_S32U8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %28, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n48k64.s32.u8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      " %24,"
+      " %25,"
+      " %26, %27,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x48x64_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x80x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x80x64_S32U8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %44, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n80k64.s32.u8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      " %40,"
+      " %41,"
+      " %42, %43,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x80x64_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x80x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x80x64_S32U8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %44, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n80k64.s32.u8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      " %40,"
+      " %41,"
+      " %42, %43,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x80x64_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x112x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x112x64_S32U8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %60, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n112k64.s32.u8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      " %56,"
+      " %57,"
+      " %58, %59,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x112x64_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x112x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x112x64_S32U8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %60, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n112k64.s32.u8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      " %56,"
+      " %57,"
+      " %58, %59,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x112x64_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x144x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x144x64_S32U8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %76, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n144k64.s32.u8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      " %72,"
+      " %73,"
+      " %74, %75,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x144x64_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x144x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x144x64_S32U8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %76, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n144k64.s32.u8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      " %72,"
+      " %73,"
+      " %74, %75,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x144x64_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x160x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x160x64_S32U8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %84, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n160k64.s32.u8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      " %80,"
+      " %81,"
+      " %82, %83,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x160x64_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x160x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x160x64_S32U8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %84, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n160k64.s32.u8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      " %80,"
+      " %81,"
+      " %82, %83,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x160x64_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x176x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x176x64_S32U8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %92, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n176k64.s32.u8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      " %88,"
+      " %89,"
+      " %90, %91,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x176x64_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x176x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x176x64_S32U8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %92, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n176k64.s32.u8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      " %88,"
+      " %89,"
+      " %90, %91,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x176x64_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x208x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x208x64_S32U8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %108, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n208k64.s32.u8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      " %104,"
+      " %105,"
+      " %106, %107,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x208x64_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x208x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x208x64_S32U8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %108, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n208k64.s32.u8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      " %104,"
+      " %105,"
+      " %106, %107,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x208x64_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x224x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x224x64_S32U8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %116, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n224k64.s32.u8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      " %112,"
+      " %113,"
+      " %114, %115,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x224x64_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x224x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x224x64_S32U8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %116, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n224k64.s32.u8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      " %112,"
+      " %113,"
+      " %114, %115,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x224x64_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x240x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x240x64_S32U8S8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %124, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n240k64.s32.u8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      " %120,"
+      " %121,"
+      " %122, %123,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x240x64_S32U8S8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x240x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x240x64_S32U8S8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %124, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n240k64.s32.u8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      " %120,"
+      " %121,"
+      " %122, %123,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x240x64_S32U8S8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x24x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x24x64_S32U8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %19, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n24k64.s32.u8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      "{%12, %13, %14, %15},"
+      " %16,"
+      " %17, %18,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x24x64_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x24x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x24x64_S32U8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %19, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n24k64.s32.u8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      "{%12, %13, %14, %15},"
+      " %16,"
+      " %17, %18,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x24x64_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x48x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x48x64_S32U8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %31, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n48k64.s32.u8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      "{%24, %25, %26, %27},"
+      " %28,"
+      " %29, %30,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x48x64_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x48x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x48x64_S32U8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %31, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n48k64.s32.u8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      "{%24, %25, %26, %27},"
+      " %28,"
+      " %29, %30,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x48x64_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x80x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x80x64_S32U8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %47, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n80k64.s32.u8.s8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      "{%40, %41, %42, %43},"
+      " %44,"
+      " %45, %46,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x80x64_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x80x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x80x64_S32U8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %47, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n80k64.s32.u8.s8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      "{%40, %41, %42, %43},"
+      " %44,"
+      " %45, %46,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x80x64_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x112x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x112x64_S32U8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %63, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n112k64.s32.u8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      "{%56,  %57,  %58,  %59},"
+      " %60,"
+      " %61, %62,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x112x64_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x112x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x112x64_S32U8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %63, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n112k64.s32.u8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      "{%56,  %57,  %58,  %59},"
+      " %60,"
+      " %61, %62,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x112x64_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x144x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x144x64_S32U8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %79, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n144k64.s32.u8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      "{%72,  %73,  %74,  %75},"
+      " %76,"
+      " %77, %78,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x144x64_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x144x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x144x64_S32U8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %79, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n144k64.s32.u8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      "{%72,  %73,  %74,  %75},"
+      " %76,"
+      " %77, %78,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x144x64_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x160x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x160x64_S32U8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %87, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n160k64.s32.u8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      "{%80,  %81,  %82,  %83},"
+      " %84,"
+      " %85, %86,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x160x64_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x160x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x160x64_S32U8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %87, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n160k64.s32.u8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      "{%80,  %81,  %82,  %83},"
+      " %84,"
+      " %85, %86,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x160x64_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x176x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x176x64_S32U8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %95, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n176k64.s32.u8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      "{%88,  %89,  %90,  %91},"
+      " %92,"
+      " %93, %94,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x176x64_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x176x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x176x64_S32U8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %95, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n176k64.s32.u8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      "{%88,  %89,  %90,  %91},"
+      " %92,"
+      " %93, %94,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x176x64_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x208x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x208x64_S32U8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %111, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n208k64.s32.u8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      "{%104, %105, %106, %107},"
+      " %108,"
+      " %109, %110,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x208x64_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x208x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x208x64_S32U8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %111, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n208k64.s32.u8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      "{%104, %105, %106, %107},"
+      " %108,"
+      " %109, %110,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x208x64_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x224x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x224x64_S32U8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %119, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n224k64.s32.u8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      "{%112, %113, %114, %115},"
+      " %116,"
+      " %117, %118,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x224x64_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x224x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x224x64_S32U8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %119, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n224k64.s32.u8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      "{%112, %113, %114, %115},"
+      " %116,"
+      " %117, %118,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x224x64_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x240x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x240x64_S32U8S8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %127, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n240k64.s32.u8.s8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      "{%120, %121, %122, %123},"
+      " %124,"
+      " %125, %126,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x240x64_S32U8S8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x240x64 TN S32+=U8*S8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x240x64_S32U8S8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %127, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n240k64.s32.u8.s8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      "{%120, %121, %122, %123},"
+      " %124,"
+      " %125, %126,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x240x64_S32U8S8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x24x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x24x64_S32U8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %16, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n24k64.s32.u8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      " %12,"
+      " %13,"
+      " %14, %15,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x24x64_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x24x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x24x64_S32U8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %16, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n24k64.s32.u8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      " %12,"
+      " %13,"
+      " %14, %15,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x24x64_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x48x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x48x64_S32U8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %28, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n48k64.s32.u8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      " %24,"
+      " %25,"
+      " %26, %27,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x48x64_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x48x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x48x64_S32U8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %28, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n48k64.s32.u8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      " %24,"
+      " %25,"
+      " %26, %27,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x48x64_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x80x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x80x64_S32U8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %44, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n80k64.s32.u8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      " %40,"
+      " %41,"
+      " %42, %43,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x80x64_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x80x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x80x64_S32U8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %44, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n80k64.s32.u8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      " %40,"
+      " %41,"
+      " %42, %43,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x80x64_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x112x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x112x64_S32U8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %60, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n112k64.s32.u8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      " %56,"
+      " %57,"
+      " %58, %59,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x112x64_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x112x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x112x64_S32U8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %60, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n112k64.s32.u8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      " %56,"
+      " %57,"
+      " %58, %59,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x112x64_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x144x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x144x64_S32U8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %76, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n144k64.s32.u8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      " %72,"
+      " %73,"
+      " %74, %75,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x144x64_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x144x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x144x64_S32U8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %76, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n144k64.s32.u8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      " %72,"
+      " %73,"
+      " %74, %75,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x144x64_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x160x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x160x64_S32U8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %84, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n160k64.s32.u8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      " %80,"
+      " %81,"
+      " %82, %83,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x160x64_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x160x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x160x64_S32U8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %84, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n160k64.s32.u8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      " %80,"
+      " %81,"
+      " %82, %83,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x160x64_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x176x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x176x64_S32U8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %92, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n176k64.s32.u8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      " %88,"
+      " %89,"
+      " %90, %91,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x176x64_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x176x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x176x64_S32U8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %92, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n176k64.s32.u8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      " %88,"
+      " %89,"
+      " %90, %91,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x176x64_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x208x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x208x64_S32U8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %108, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n208k64.s32.u8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      " %104,"
+      " %105,"
+      " %106, %107,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x208x64_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x208x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x208x64_S32U8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %108, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n208k64.s32.u8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      " %104,"
+      " %105,"
+      " %106, %107,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x208x64_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x224x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x224x64_S32U8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %116, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n224k64.s32.u8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      " %112,"
+      " %113,"
+      " %114, %115,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x224x64_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x224x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x224x64_S32U8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %116, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n224k64.s32.u8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      " %112,"
+      " %113,"
+      " %114, %115,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x224x64_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x240x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x240x64_S32U8U8_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %124, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n240k64.s32.u8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      " %120,"
+      " %121,"
+      " %122, %123,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x240x64_S32U8U8_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x240x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x240x64_S32U8U8_SS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %124, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n240k64.s32.u8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      " %120,"
+      " %121,"
+      " %122, %123,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x240x64_S32U8U8_SS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x24x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x24x64_S32U8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %19, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n24k64.s32.u8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      "{%12, %13, %14, %15},"
+      " %16,"
+      " %17, %18,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x24x64_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x24x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x24x64_S32U8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %19, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n24k64.s32.u8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      "{%12, %13, %14, %15},"
+      " %16,"
+      " %17, %18,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x24x64_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x48x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x48x64_S32U8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %31, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n48k64.s32.u8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      "{%24, %25, %26, %27},"
+      " %28,"
+      " %29, %30,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x48x64_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x48x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x48x64_S32U8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %31, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n48k64.s32.u8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      "{%24, %25, %26, %27},"
+      " %28,"
+      " %29, %30,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x48x64_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x80x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x80x64_S32U8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %47, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n80k64.s32.u8.u8 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      "{%40, %41, %42, %43},"
+      " %44,"
+      " %45, %46,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x80x64_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x80x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x80x64_S32U8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %47, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n80k64.s32.u8.u8.satfinite "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      "{%40, %41, %42, %43},"
+      " %44,"
+      " %45, %46,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x80x64_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x112x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x112x64_S32U8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %63, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n112k64.s32.u8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      "{%56,  %57,  %58,  %59},"
+      " %60,"
+      " %61, %62,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x112x64_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x112x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x112x64_S32U8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %63, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n112k64.s32.u8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      "{%56,  %57,  %58,  %59},"
+      " %60,"
+      " %61, %62,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x112x64_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x144x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x144x64_S32U8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %79, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n144k64.s32.u8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      "{%72,  %73,  %74,  %75},"
+      " %76,"
+      " %77, %78,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x144x64_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x144x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x144x64_S32U8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %79, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n144k64.s32.u8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      "{%72,  %73,  %74,  %75},"
+      " %76,"
+      " %77, %78,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x144x64_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x160x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x160x64_S32U8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %87, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n160k64.s32.u8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      "{%80,  %81,  %82,  %83},"
+      " %84,"
+      " %85, %86,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x160x64_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x160x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x160x64_S32U8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %87, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n160k64.s32.u8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      "{%80,  %81,  %82,  %83},"
+      " %84,"
+      " %85, %86,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x160x64_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x176x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x176x64_S32U8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %95, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n176k64.s32.u8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      "{%88,  %89,  %90,  %91},"
+      " %92,"
+      " %93, %94,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x176x64_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x176x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x176x64_S32U8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61, uint32_t      & d62, uint32_t      & d63,
+      uint32_t      & d64, uint32_t      & d65, uint32_t      & d66, uint32_t      & d67,
+      uint32_t      & d68, uint32_t      & d69, uint32_t      & d70, uint32_t      & d71,
+      uint32_t      & d72, uint32_t      & d73, uint32_t      & d74, uint32_t      & d75,
+      uint32_t      & d76, uint32_t      & d77, uint32_t      & d78, uint32_t      & d79,
+      uint32_t      & d80, uint32_t      & d81, uint32_t      & d82, uint32_t      & d83,
+      uint32_t      & d84, uint32_t      & d85, uint32_t      & d86, uint32_t      & d87,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %95, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n176k64.s32.u8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      "{%88,  %89,  %90,  %91},"
+      " %92,"
+      " %93, %94,"
+      " p;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61), "+r"(d62), "+r"(d63),
+        "+r"(d64), "+r"(d65), "+r"(d66), "+r"(d67),
+        "+r"(d68), "+r"(d69), "+r"(d70), "+r"(d71),
+        "+r"(d72), "+r"(d73), "+r"(d74), "+r"(d75),
+        "+r"(d76), "+r"(d77), "+r"(d78), "+r"(d79),
+        "+r"(d80), "+r"(d81), "+r"(d82), "+r"(d83),
+        "+r"(d84), "+r"(d85), "+r"(d86), "+r"(d87)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x176x64_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x208x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x208x64_S32U8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %111, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n208k64.s32.u8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      "{%104, %105, %106, %107},"
+      " %108,"
+      " %109, %110,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x208x64_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x208x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x208x64_S32U8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %111, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n208k64.s32.u8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      "{%104, %105, %106, %107},"
+      " %108,"
+      " %109, %110,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x208x64_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x224x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x224x64_S32U8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %119, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n224k64.s32.u8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      "{%112, %113, %114, %115},"
+      " %116,"
+      " %117, %118,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x224x64_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x224x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x224x64_S32U8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %119, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n224k64.s32.u8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      "{%112, %113, %114, %115},"
+      " %116,"
+      " %117, %118,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x224x64_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x240x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x240x64_S32U8U8_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %127, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n240k64.s32.u8.u8 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      "{%120, %121, %122, %123},"
+      " %124,"
+      " %125, %126,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x240x64_S32U8U8_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x240x64 TN S32+=U8*U8
+template <
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x240x64_S32U8U8_RS_TN_SATURATE
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      uint32_t      & d000, uint32_t      & d001, uint32_t      & d002, uint32_t      & d003,
+      uint32_t      & d004, uint32_t      & d005, uint32_t      & d006, uint32_t      & d007,
+      uint32_t      & d008, uint32_t      & d009, uint32_t      & d010, uint32_t      & d011,
+      uint32_t      & d012, uint32_t      & d013, uint32_t      & d014, uint32_t      & d015,
+      uint32_t      & d016, uint32_t      & d017, uint32_t      & d018, uint32_t      & d019,
+      uint32_t      & d020, uint32_t      & d021, uint32_t      & d022, uint32_t      & d023,
+      uint32_t      & d024, uint32_t      & d025, uint32_t      & d026, uint32_t      & d027,
+      uint32_t      & d028, uint32_t      & d029, uint32_t      & d030, uint32_t      & d031,
+      uint32_t      & d032, uint32_t      & d033, uint32_t      & d034, uint32_t      & d035,
+      uint32_t      & d036, uint32_t      & d037, uint32_t      & d038, uint32_t      & d039,
+      uint32_t      & d040, uint32_t      & d041, uint32_t      & d042, uint32_t      & d043,
+      uint32_t      & d044, uint32_t      & d045, uint32_t      & d046, uint32_t      & d047,
+      uint32_t      & d048, uint32_t      & d049, uint32_t      & d050, uint32_t      & d051,
+      uint32_t      & d052, uint32_t      & d053, uint32_t      & d054, uint32_t      & d055,
+      uint32_t      & d056, uint32_t      & d057, uint32_t      & d058, uint32_t      & d059,
+      uint32_t      & d060, uint32_t      & d061, uint32_t      & d062, uint32_t      & d063,
+      uint32_t      & d064, uint32_t      & d065, uint32_t      & d066, uint32_t      & d067,
+      uint32_t      & d068, uint32_t      & d069, uint32_t      & d070, uint32_t      & d071,
+      uint32_t      & d072, uint32_t      & d073, uint32_t      & d074, uint32_t      & d075,
+      uint32_t      & d076, uint32_t      & d077, uint32_t      & d078, uint32_t      & d079,
+      uint32_t      & d080, uint32_t      & d081, uint32_t      & d082, uint32_t      & d083,
+      uint32_t      & d084, uint32_t      & d085, uint32_t      & d086, uint32_t      & d087,
+      uint32_t      & d088, uint32_t      & d089, uint32_t      & d090, uint32_t      & d091,
+      uint32_t      & d092, uint32_t      & d093, uint32_t      & d094, uint32_t      & d095,
+      uint32_t      & d096, uint32_t      & d097, uint32_t      & d098, uint32_t      & d099,
+      uint32_t      & d100, uint32_t      & d101, uint32_t      & d102, uint32_t      & d103,
+      uint32_t      & d104, uint32_t      & d105, uint32_t      & d106, uint32_t      & d107,
+      uint32_t      & d108, uint32_t      & d109, uint32_t      & d110, uint32_t      & d111,
+      uint32_t      & d112, uint32_t      & d113, uint32_t      & d114, uint32_t      & d115,
+      uint32_t      & d116, uint32_t      & d117, uint32_t      & d118, uint32_t      & d119,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %127, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n240k64.s32.u8.u8.satfinite "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      "{%120, %121, %122, %123},"
+      " %124,"
+      " %125, %126,"
+      " p;\n"
+    "}\n"
+      : "+r"(d000), "+r"(d001), "+r"(d002), "+r"(d003),
+        "+r"(d004), "+r"(d005), "+r"(d006), "+r"(d007),
+        "+r"(d008), "+r"(d009), "+r"(d010), "+r"(d011),
+        "+r"(d012), "+r"(d013), "+r"(d014), "+r"(d015),
+        "+r"(d016), "+r"(d017), "+r"(d018), "+r"(d019),
+        "+r"(d020), "+r"(d021), "+r"(d022), "+r"(d023),
+        "+r"(d024), "+r"(d025), "+r"(d026), "+r"(d027),
+        "+r"(d028), "+r"(d029), "+r"(d030), "+r"(d031),
+        "+r"(d032), "+r"(d033), "+r"(d034), "+r"(d035),
+        "+r"(d036), "+r"(d037), "+r"(d038), "+r"(d039),
+        "+r"(d040), "+r"(d041), "+r"(d042), "+r"(d043),
+        "+r"(d044), "+r"(d045), "+r"(d046), "+r"(d047),
+        "+r"(d048), "+r"(d049), "+r"(d050), "+r"(d051),
+        "+r"(d052), "+r"(d053), "+r"(d054), "+r"(d055),
+        "+r"(d056), "+r"(d057), "+r"(d058), "+r"(d059),
+        "+r"(d060), "+r"(d061), "+r"(d062), "+r"(d063),
+        "+r"(d064), "+r"(d065), "+r"(d066), "+r"(d067),
+        "+r"(d068), "+r"(d069), "+r"(d070), "+r"(d071),
+        "+r"(d072), "+r"(d073), "+r"(d074), "+r"(d075),
+        "+r"(d076), "+r"(d077), "+r"(d078), "+r"(d079),
+        "+r"(d080), "+r"(d081), "+r"(d082), "+r"(d083),
+        "+r"(d084), "+r"(d085), "+r"(d086), "+r"(d087),
+        "+r"(d088), "+r"(d089), "+r"(d090), "+r"(d091),
+        "+r"(d092), "+r"(d093), "+r"(d094), "+r"(d095),
+        "+r"(d096), "+r"(d097), "+r"(d098), "+r"(d099),
+        "+r"(d100), "+r"(d101), "+r"(d102), "+r"(d103),
+        "+r"(d104), "+r"(d105), "+r"(d106), "+r"(d107),
+        "+r"(d108), "+r"(d109), "+r"(d110), "+r"(d111),
+        "+r"(d112), "+r"(d113), "+r"(d114), "+r"(d115),
+        "+r"(d116), "+r"(d117), "+r"(d118), "+r"(d119)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x240x64_S32U8U8_RS_TN_SATURATE without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x24x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x24x64_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[6];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %10, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n24k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5},"
+      " %6,"
+      " %7,"
+      " %8, %9,"
+      " p,   %11, %12;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x24x64_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x24x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x24x64_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[6];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %13, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n24k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5},"
+      "{%6,  %7,  %8,  %9},"
+      " %10,"
+      " %11, %12,"
+      " p,   %14, %15;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x24x64_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x24x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x24x64_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %16, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n24k64.f32.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      " %12,"
+      " %13,"
+      " %14, %15,"
+      " p,   %17, %18;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x24x64_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x24x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x24x64_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %19, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n24k64.f32.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      "{%12, %13, %14, %15},"
+      " %16,"
+      " %17, %18,"
+      " p,   %20, %21;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x24x64_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x40x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x40x64_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[10];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %14, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n40k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9},"
+      " %10,"
+      " %11,"
+      " %12, %13,"
+      " p,   %15, %16;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x40x64_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x40x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x40x64_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[10];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %17, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n40k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9},"
+      "{%10, %11, %12, %13},"
+      " %14,"
+      " %15, %16,"
+      " p,   %18, %19;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x40x64_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x40x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x40x64_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[20];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %24, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n40k64.f32.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      " %20,"
+      " %21,"
+      " %22, %23,"
+      " p,   %25, %26;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x40x64_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x40x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x40x64_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[20];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %27, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n40k64.f32.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      "{%20, %21, %22, %23},"
+      " %24,"
+      " %25, %26,"
+      " p,   %28, %29;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x40x64_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x48x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x48x64_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %16, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n48k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      " %12,"
+      " %13,"
+      " %14, %15,"
+      " p,   %17, %18;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x48x64_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x48x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x48x64_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %19, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n48k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      "{%12, %13, %14, %15},"
+      " %16,"
+      " %17, %18,"
+      " p,   %20, %21;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x48x64_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x48x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x48x64_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %28, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n48k64.f32.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      " %24,"
+      " %25,"
+      " %26, %27,"
+      " p,   %29, %30;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x48x64_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x48x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x48x64_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %31, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n48k64.f32.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      "{%24, %25, %26, %27},"
+      " %28,"
+      " %29, %30,"
+      " p,   %32, %33;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x48x64_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x56x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x56x64_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[14];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %18, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n56k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13},"
+      " %14,"
+      " %15,"
+      " %16, %17,"
+      " p,   %19, %20;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x56x64_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x56x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x56x64_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[14];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %21, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n56k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13},"
+      "{%14, %15, %16, %17},"
+      " %18,"
+      " %19, %20,"
+      " p,   %22, %23;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x56x64_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x56x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x56x64_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[28];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %32, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n56k64.f32.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      " %28,"
+      " %29,"
+      " %30, %31,"
+      " p,   %33, %34;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x56x64_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x56x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x56x64_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[28];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %35, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n56k64.f32.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      "{%28, %29, %30, %31},"
+      " %32,"
+      " %33, %34,"
+      " p,   %36, %37;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x56x64_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x72x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x72x64_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[18];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %22, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n72k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17},"
+      " %18,"
+      " %19,"
+      " %20, %21,"
+      " p,   %23, %24;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x72x64_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x72x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x72x64_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[18];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %25, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n72k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17},"
+      "{%18, %19, %20, %21},"
+      " %22,"
+      " %23, %24,"
+      " p,   %26, %27;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x72x64_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x72x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x72x64_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[36];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %40, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n72k64.f32.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      " %36,"
+      " %37,"
+      " %38, %39,"
+      " p,   %41, %42;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x72x64_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x72x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x72x64_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[36];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %43, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n72k64.f32.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      "{%36, %37, %38, %39},"
+      " %40,"
+      " %41, %42,"
+      " p,   %44, %45;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x72x64_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x80x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x80x64_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[20];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %24, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n80k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      " %20,"
+      " %21,"
+      " %22, %23,"
+      " p,   %25, %26;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x80x64_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x80x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x80x64_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[20];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %27, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n80k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      "{%20, %21, %22, %23},"
+      " %24,"
+      " %25, %26,"
+      " p,   %28, %29;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x80x64_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x80x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x80x64_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %44, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n80k64.f32.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      " %40,"
+      " %41,"
+      " %42, %43,"
+      " p,   %45, %46;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x80x64_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x80x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x80x64_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %47, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n80k64.f32.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      "{%40, %41, %42, %43},"
+      " %44,"
+      " %45, %46,"
+      " p,   %48, %49;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x80x64_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x88x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x88x64_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[22];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %26, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n88k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21},"
+      " %22,"
+      " %23,"
+      " %24, %25,"
+      " p,   %27, %28;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x88x64_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x88x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x88x64_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[22];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %29, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n88k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21},"
+      "{%22, %23, %24, %25},"
+      " %26,"
+      " %27, %28,"
+      " p,   %30, %31;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x88x64_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x88x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x88x64_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[44];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %48, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n88k64.f32.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      " %44,"
+      " %45,"
+      " %46, %47,"
+      " p,   %49, %50;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x88x64_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x88x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x88x64_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[44];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %51, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n88k64.f32.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      "{%44, %45, %46, %47},"
+      " %48,"
+      " %49, %50,"
+      " p,   %52, %53;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x88x64_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x104x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x104x64_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[26];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %30, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n104k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25},"
+      " %26,"
+      " %27,"
+      " %28, %29,"
+      " p,   %31, %32;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x104x64_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x104x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x104x64_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[26];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %33, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n104k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25},"
+      "{%26, %27, %28, %29},"
+      " %30,"
+      " %31, %32,"
+      " p,   %34, %35;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x104x64_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x104x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x104x64_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[52];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %56, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n104k64.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      " %52,"
+      " %53,"
+      " %54, %55,"
+      " p,    %57,  %58;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x104x64_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x104x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x104x64_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[52];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %59, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n104k64.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      "{%52,  %53,  %54,  %55},"
+      " %56,"
+      " %57, %58,"
+      " p,    %60,  %61;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x104x64_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x112x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x112x64_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[28];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %32, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n112k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      " %28,"
+      " %29,"
+      " %30, %31,"
+      " p,   %33, %34;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x112x64_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x112x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x112x64_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[28];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %35, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n112k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      "{%28, %29, %30, %31},"
+      " %32,"
+      " %33, %34,"
+      " p,   %36, %37;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x112x64_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x112x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x112x64_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %60, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n112k64.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      " %56,"
+      " %57,"
+      " %58, %59,"
+      " p,    %61,  %62;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x112x64_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x112x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x112x64_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %63, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n112k64.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      "{%56,  %57,  %58,  %59},"
+      " %60,"
+      " %61, %62,"
+      " p,    %64,  %65;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x112x64_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x120x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x120x64_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[30];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %34, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n120k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29},"
+      " %30,"
+      " %31,"
+      " %32, %33,"
+      " p,   %35, %36;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x120x64_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x120x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x120x64_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[30];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %37, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n120k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29},"
+      "{%30, %31, %32, %33},"
+      " %34,"
+      " %35, %36,"
+      " p,   %38, %39;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x120x64_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x120x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x120x64_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[60];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %64, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n120k64.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      " %60,"
+      " %61,"
+      " %62, %63,"
+      " p,    %65,  %66;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x120x64_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x120x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x120x64_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[60];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %67, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n120k64.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      "{%60,  %61,  %62,  %63},"
+      " %64,"
+      " %65, %66,"
+      " p,    %68,  %69;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x120x64_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x136x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x136x64_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[34];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %38, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n136k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33},"
+      " %34,"
+      " %35,"
+      " %36, %37,"
+      " p,   %39, %40;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x136x64_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x136x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x136x64_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[34];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %41, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n136k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33},"
+      "{%34, %35, %36, %37},"
+      " %38,"
+      " %39, %40,"
+      " p,   %42, %43;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x136x64_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x136x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x136x64_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[68];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %72, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n136k64.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67},"
+      " %68,"
+      " %69,"
+      " %70, %71,"
+      " p,    %73,  %74;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x136x64_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x136x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x136x64_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[68];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %75, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n136k64.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67},"
+      "{%68,  %69,  %70,  %71},"
+      " %72,"
+      " %73, %74,"
+      " p,    %76,  %77;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x136x64_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x144x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x144x64_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[36];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %40, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n144k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      " %36,"
+      " %37,"
+      " %38, %39,"
+      " p,   %41, %42;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x144x64_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x144x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x144x64_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[36];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %43, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n144k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      "{%36, %37, %38, %39},"
+      " %40,"
+      " %41, %42,"
+      " p,   %44, %45;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x144x64_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x144x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x144x64_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %76, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n144k64.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      " %72,"
+      " %73,"
+      " %74, %75,"
+      " p,    %77,  %78;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x144x64_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x144x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x144x64_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %79, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n144k64.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      "{%72,  %73,  %74,  %75},"
+      " %76,"
+      " %77, %78,"
+      " p,    %80,  %81;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x144x64_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x152x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x152x64_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[38];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %42, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n152k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37},"
+      " %38,"
+      " %39,"
+      " %40, %41,"
+      " p,   %43, %44;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x152x64_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x152x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x152x64_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[38];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %45, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n152k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37},"
+      "{%38, %39, %40, %41},"
+      " %42,"
+      " %43, %44,"
+      " p,   %46, %47;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x152x64_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x152x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x152x64_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[76];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %80, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n152k64.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75},"
+      " %76,"
+      " %77,"
+      " %78, %79,"
+      " p,    %81,  %82;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x152x64_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x152x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x152x64_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[76];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %83, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n152k64.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75},"
+      "{%76,  %77,  %78,  %79},"
+      " %80,"
+      " %81, %82,"
+      " p,    %84,  %85;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x152x64_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x160x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x160x64_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %44, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n160k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      " %40,"
+      " %41,"
+      " %42, %43,"
+      " p,   %45, %46;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x160x64_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x160x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x160x64_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %47, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n160k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      "{%40, %41, %42, %43},"
+      " %44,"
+      " %45, %46,"
+      " p,   %48, %49;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x160x64_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x160x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x160x64_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %84, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n160k64.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      " %80,"
+      " %81,"
+      " %82, %83,"
+      " p,    %85,  %86;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x160x64_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x160x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x160x64_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %87, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n160k64.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      "{%80,  %81,  %82,  %83},"
+      " %84,"
+      " %85, %86,"
+      " p,    %88,  %89;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x160x64_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x168x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x168x64_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[42];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %46, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n168k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41},"
+      " %42,"
+      " %43,"
+      " %44, %45,"
+      " p,   %47, %48;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x168x64_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x168x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x168x64_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[42];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %49, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n168k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41},"
+      "{%42, %43, %44, %45},"
+      " %46,"
+      " %47, %48,"
+      " p,   %50, %51;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x168x64_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x168x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x168x64_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[84];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %88, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n168k64.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83},"
+      " %84,"
+      " %85,"
+      " %86, %87,"
+      " p,    %89,  %90;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x168x64_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x168x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x168x64_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[84];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %91, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n168k64.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83},"
+      "{%84,  %85,  %86,  %87},"
+      " %88,"
+      " %89, %90,"
+      " p,    %92,  %93;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x168x64_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x176x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x176x64_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[44];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %48, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n176k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      " %44,"
+      " %45,"
+      " %46, %47,"
+      " p,   %49, %50;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x176x64_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x176x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x176x64_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[44];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %51, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n176k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      "{%44, %45, %46, %47},"
+      " %48,"
+      " %49, %50,"
+      " p,   %52, %53;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x176x64_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x176x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x176x64_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %92, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n176k64.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      " %88,"
+      " %89,"
+      " %90, %91,"
+      " p,    %93,  %94;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x176x64_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x176x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x176x64_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %95, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n176k64.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      "{%88,  %89,  %90,  %91},"
+      " %92,"
+      " %93, %94,"
+      " p,    %96,  %97;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x176x64_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x184x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x184x64_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[46];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %50, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n184k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45},"
+      " %46,"
+      " %47,"
+      " %48, %49,"
+      " p,   %51, %52;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x184x64_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x184x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x184x64_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[46];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %53, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n184k64.f16.e4m3.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45},"
+      "{%46, %47, %48, %49},"
+      " %50,"
+      " %51, %52,"
+      " p,   %54, %55;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x184x64_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x184x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x184x64_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[92];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %96, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n184k64.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91},"
+      " %92,"
+      " %93,"
+      " %94, %95,"
+      " p,    %97,  %98;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x184x64_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x184x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x184x64_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[92];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %99, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n184k64.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91},"
+      "{%92,  %93,  %94,  %95},"
+      " %96,"
+      " %97, %98,"
+      " p,    %100, %101;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x184x64_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x200x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x200x64_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[50];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %54, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n200k64.f16.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49},"
+      " %50,"
+      " %51,"
+      " %52, %53,"
+      " p,    %55,  %56;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x200x64_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x200x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x200x64_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[50];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %57, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n200k64.f16.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49},"
+      "{%50,  %51,  %52,  %53},"
+      " %54,"
+      " %55, %56,"
+      " p,    %58,  %59;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x200x64_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x200x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x200x64_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[100];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %104, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n200k64.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99},"
+      " %100,"
+      " %101,"
+      " %102, %103,"
+      " p,    %105, %106;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x200x64_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x200x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x200x64_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[100];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %107, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n200k64.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99},"
+      "{%100, %101, %102, %103},"
+      " %104,"
+      " %105, %106,"
+      " p,    %108, %109;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x200x64_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x208x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x208x64_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[52];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %56, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n208k64.f16.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      " %52,"
+      " %53,"
+      " %54, %55,"
+      " p,    %57,  %58;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x208x64_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x208x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x208x64_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[52];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %59, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n208k64.f16.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      "{%52,  %53,  %54,  %55},"
+      " %56,"
+      " %57, %58,"
+      " p,    %60,  %61;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x208x64_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x208x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x208x64_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %108, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n208k64.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      " %104,"
+      " %105,"
+      " %106, %107,"
+      " p,    %109, %110;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x208x64_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x208x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x208x64_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %111, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n208k64.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      "{%104, %105, %106, %107},"
+      " %108,"
+      " %109, %110,"
+      " p,    %112, %113;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x208x64_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x216x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x216x64_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[54];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %58, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n216k64.f16.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53},"
+      " %54,"
+      " %55,"
+      " %56, %57,"
+      " p,    %59,  %60;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x216x64_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x216x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x216x64_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[54];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %61, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n216k64.f16.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53},"
+      "{%54,  %55,  %56,  %57},"
+      " %58,"
+      " %59, %60,"
+      " p,    %62,  %63;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x216x64_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x216x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x216x64_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[108];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %112, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n216k64.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107},"
+      " %108,"
+      " %109,"
+      " %110, %111,"
+      " p,    %113, %114;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x216x64_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x216x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x216x64_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[108];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %115, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n216k64.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107},"
+      "{%108, %109, %110, %111},"
+      " %112,"
+      " %113, %114,"
+      " p,    %116, %117;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x216x64_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x224x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x224x64_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %60, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n224k64.f16.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      " %56,"
+      " %57,"
+      " %58, %59,"
+      " p,    %61,  %62;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x224x64_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x224x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x224x64_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %63, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n224k64.f16.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      "{%56,  %57,  %58,  %59},"
+      " %60,"
+      " %61, %62,"
+      " p,    %64,  %65;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x224x64_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x224x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x224x64_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %116, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n224k64.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      " %112,"
+      " %113,"
+      " %114, %115,"
+      " p,    %117, %118;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x224x64_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x224x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x224x64_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %119, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n224k64.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      "{%112, %113, %114, %115},"
+      " %116,"
+      " %117, %118,"
+      " p,    %120, %121;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x224x64_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x232x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x232x64_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[58];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %62, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n232k64.f16.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57},"
+      " %58,"
+      " %59,"
+      " %60, %61,"
+      " p,    %63,  %64;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x232x64_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x232x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x232x64_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[58];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %65, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n232k64.f16.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57},"
+      "{%58,  %59,  %60,  %61},"
+      " %62,"
+      " %63, %64,"
+      " p,    %66,  %67;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x232x64_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x232x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x232x64_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[116];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %120, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n232k64.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115},"
+      " %116,"
+      " %117,"
+      " %118, %119,"
+      " p,    %121, %122;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x232x64_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x232x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x232x64_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[116];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %123, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n232k64.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115},"
+      "{%116, %117, %118, %119},"
+      " %120,"
+      " %121, %122,"
+      " p,    %124, %125;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x232x64_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x240x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x240x64_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[60];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %64, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n240k64.f16.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      " %60,"
+      " %61,"
+      " %62, %63,"
+      " p,    %65,  %66;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x240x64_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x240x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x240x64_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[60];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %67, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n240k64.f16.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      "{%60,  %61,  %62,  %63},"
+      " %64,"
+      " %65, %66,"
+      " p,    %68,  %69;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x240x64_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x240x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x240x64_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %124, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n240k64.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      " %120,"
+      " %121,"
+      " %122, %123,"
+      " p,    %125, %126;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x240x64_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x240x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x240x64_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %127, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n240k64.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      "{%120, %121, %122, %123},"
+      " %124,"
+      " %125, %126,"
+      " p,    %128, %129;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x240x64_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x248x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x248x64_F16E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[62];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %66, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n248k64.f16.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61},"
+      " %62,"
+      " %63,"
+      " %64, %65,"
+      " p,    %67,  %68;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x248x64_F16E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x248x64 TN F16+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x248x64_F16E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[62];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %69, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n248k64.f16.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61},"
+      "{%62,  %63,  %64,  %65},"
+      " %66,"
+      " %67, %68,"
+      " p,    %70,  %71;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x248x64_F16E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x248x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x248x64_F32E4M3E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[124];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %128, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n248k64.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123},"
+      " %124,"
+      " %125,"
+      " %126, %127,"
+      " p,    %129, %130;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x248x64_F32E4M3E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x248x64 TN F32+=E4M3*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x248x64_F32E4M3E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[124];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %131, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n248k64.f32.e4m3.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123},"
+      "{%124, %125, %126, %127},"
+      " %128,"
+      " %129, %130,"
+      " p,    %132, %133;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x248x64_F32E4M3E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x24x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x24x64_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[6];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %10, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n24k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5},"
+      " %6,"
+      " %7,"
+      " %8, %9,"
+      " p,   %11, %12;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x24x64_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x24x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x24x64_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[6];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %13, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n24k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5},"
+      "{%6,  %7,  %8,  %9},"
+      " %10,"
+      " %11, %12,"
+      " p,   %14, %15;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x24x64_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x24x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x24x64_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %16, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n24k64.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      " %12,"
+      " %13,"
+      " %14, %15,"
+      " p,   %17, %18;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x24x64_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x24x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x24x64_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %19, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n24k64.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      "{%12, %13, %14, %15},"
+      " %16,"
+      " %17, %18,"
+      " p,   %20, %21;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x24x64_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x40x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x40x64_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[10];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %14, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n40k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9},"
+      " %10,"
+      " %11,"
+      " %12, %13,"
+      " p,   %15, %16;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x40x64_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x40x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x40x64_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[10];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %17, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n40k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9},"
+      "{%10, %11, %12, %13},"
+      " %14,"
+      " %15, %16,"
+      " p,   %18, %19;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x40x64_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x40x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x40x64_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[20];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %24, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n40k64.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      " %20,"
+      " %21,"
+      " %22, %23,"
+      " p,   %25, %26;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x40x64_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x40x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x40x64_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[20];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %27, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n40k64.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      "{%20, %21, %22, %23},"
+      " %24,"
+      " %25, %26,"
+      " p,   %28, %29;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x40x64_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x48x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x48x64_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %16, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n48k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      " %12,"
+      " %13,"
+      " %14, %15,"
+      " p,   %17, %18;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x48x64_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x48x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x48x64_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %19, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n48k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      "{%12, %13, %14, %15},"
+      " %16,"
+      " %17, %18,"
+      " p,   %20, %21;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x48x64_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x48x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x48x64_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %28, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n48k64.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      " %24,"
+      " %25,"
+      " %26, %27,"
+      " p,   %29, %30;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x48x64_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x48x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x48x64_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %31, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n48k64.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      "{%24, %25, %26, %27},"
+      " %28,"
+      " %29, %30,"
+      " p,   %32, %33;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x48x64_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x56x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x56x64_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[14];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %18, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n56k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13},"
+      " %14,"
+      " %15,"
+      " %16, %17,"
+      " p,   %19, %20;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x56x64_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x56x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x56x64_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[14];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %21, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n56k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13},"
+      "{%14, %15, %16, %17},"
+      " %18,"
+      " %19, %20,"
+      " p,   %22, %23;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x56x64_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x56x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x56x64_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[28];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %32, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n56k64.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      " %28,"
+      " %29,"
+      " %30, %31,"
+      " p,   %33, %34;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x56x64_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x56x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x56x64_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[28];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %35, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n56k64.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      "{%28, %29, %30, %31},"
+      " %32,"
+      " %33, %34,"
+      " p,   %36, %37;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x56x64_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x72x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x72x64_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[18];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %22, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n72k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17},"
+      " %18,"
+      " %19,"
+      " %20, %21,"
+      " p,   %23, %24;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x72x64_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x72x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x72x64_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[18];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %25, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n72k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17},"
+      "{%18, %19, %20, %21},"
+      " %22,"
+      " %23, %24,"
+      " p,   %26, %27;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x72x64_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x72x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x72x64_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[36];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %40, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n72k64.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      " %36,"
+      " %37,"
+      " %38, %39,"
+      " p,   %41, %42;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x72x64_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x72x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x72x64_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[36];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %43, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n72k64.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      "{%36, %37, %38, %39},"
+      " %40,"
+      " %41, %42,"
+      " p,   %44, %45;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x72x64_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x80x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x80x64_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[20];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %24, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n80k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      " %20,"
+      " %21,"
+      " %22, %23,"
+      " p,   %25, %26;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x80x64_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x80x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x80x64_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[20];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %27, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n80k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      "{%20, %21, %22, %23},"
+      " %24,"
+      " %25, %26,"
+      " p,   %28, %29;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x80x64_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x80x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x80x64_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %44, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n80k64.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      " %40,"
+      " %41,"
+      " %42, %43,"
+      " p,   %45, %46;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x80x64_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x80x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x80x64_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %47, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n80k64.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      "{%40, %41, %42, %43},"
+      " %44,"
+      " %45, %46,"
+      " p,   %48, %49;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x80x64_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x88x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x88x64_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[22];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %26, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n88k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21},"
+      " %22,"
+      " %23,"
+      " %24, %25,"
+      " p,   %27, %28;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x88x64_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x88x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x88x64_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[22];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %29, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n88k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21},"
+      "{%22, %23, %24, %25},"
+      " %26,"
+      " %27, %28,"
+      " p,   %30, %31;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x88x64_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x88x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x88x64_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[44];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %48, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n88k64.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      " %44,"
+      " %45,"
+      " %46, %47,"
+      " p,   %49, %50;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x88x64_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x88x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x88x64_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[44];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %51, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n88k64.f32.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      "{%44, %45, %46, %47},"
+      " %48,"
+      " %49, %50,"
+      " p,   %52, %53;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x88x64_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x104x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x104x64_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[26];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %30, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n104k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25},"
+      " %26,"
+      " %27,"
+      " %28, %29,"
+      " p,   %31, %32;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x104x64_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x104x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x104x64_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[26];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %33, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n104k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25},"
+      "{%26, %27, %28, %29},"
+      " %30,"
+      " %31, %32,"
+      " p,   %34, %35;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x104x64_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x104x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x104x64_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[52];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %56, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n104k64.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      " %52,"
+      " %53,"
+      " %54, %55,"
+      " p,    %57,  %58;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x104x64_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x104x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x104x64_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[52];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %59, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n104k64.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      "{%52,  %53,  %54,  %55},"
+      " %56,"
+      " %57, %58,"
+      " p,    %60,  %61;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x104x64_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x112x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x112x64_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[28];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %32, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n112k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      " %28,"
+      " %29,"
+      " %30, %31,"
+      " p,   %33, %34;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x112x64_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x112x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x112x64_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[28];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %35, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n112k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      "{%28, %29, %30, %31},"
+      " %32,"
+      " %33, %34,"
+      " p,   %36, %37;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x112x64_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x112x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x112x64_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %60, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n112k64.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      " %56,"
+      " %57,"
+      " %58, %59,"
+      " p,    %61,  %62;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x112x64_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x112x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x112x64_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %63, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n112k64.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      "{%56,  %57,  %58,  %59},"
+      " %60,"
+      " %61, %62,"
+      " p,    %64,  %65;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x112x64_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x120x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x120x64_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[30];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %34, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n120k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29},"
+      " %30,"
+      " %31,"
+      " %32, %33,"
+      " p,   %35, %36;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x120x64_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x120x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x120x64_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[30];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %37, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n120k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29},"
+      "{%30, %31, %32, %33},"
+      " %34,"
+      " %35, %36,"
+      " p,   %38, %39;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x120x64_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x120x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x120x64_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[60];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %64, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n120k64.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      " %60,"
+      " %61,"
+      " %62, %63,"
+      " p,    %65,  %66;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x120x64_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x120x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x120x64_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[60];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %67, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n120k64.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      "{%60,  %61,  %62,  %63},"
+      " %64,"
+      " %65, %66,"
+      " p,    %68,  %69;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x120x64_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x136x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x136x64_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[34];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %38, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n136k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33},"
+      " %34,"
+      " %35,"
+      " %36, %37,"
+      " p,   %39, %40;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x136x64_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x136x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x136x64_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[34];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %41, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n136k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33},"
+      "{%34, %35, %36, %37},"
+      " %38,"
+      " %39, %40,"
+      " p,   %42, %43;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x136x64_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x136x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x136x64_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[68];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %72, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n136k64.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67},"
+      " %68,"
+      " %69,"
+      " %70, %71,"
+      " p,    %73,  %74;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x136x64_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x136x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x136x64_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[68];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %75, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n136k64.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67},"
+      "{%68,  %69,  %70,  %71},"
+      " %72,"
+      " %73, %74,"
+      " p,    %76,  %77;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x136x64_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x144x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x144x64_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[36];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %40, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n144k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      " %36,"
+      " %37,"
+      " %38, %39,"
+      " p,   %41, %42;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x144x64_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x144x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x144x64_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[36];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %43, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n144k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      "{%36, %37, %38, %39},"
+      " %40,"
+      " %41, %42,"
+      " p,   %44, %45;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x144x64_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x144x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x144x64_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %76, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n144k64.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      " %72,"
+      " %73,"
+      " %74, %75,"
+      " p,    %77,  %78;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x144x64_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x144x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x144x64_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %79, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n144k64.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      "{%72,  %73,  %74,  %75},"
+      " %76,"
+      " %77, %78,"
+      " p,    %80,  %81;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x144x64_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x152x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x152x64_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[38];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %42, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n152k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37},"
+      " %38,"
+      " %39,"
+      " %40, %41,"
+      " p,   %43, %44;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x152x64_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x152x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x152x64_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[38];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %45, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n152k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37},"
+      "{%38, %39, %40, %41},"
+      " %42,"
+      " %43, %44,"
+      " p,   %46, %47;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x152x64_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x152x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x152x64_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[76];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %80, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n152k64.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75},"
+      " %76,"
+      " %77,"
+      " %78, %79,"
+      " p,    %81,  %82;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x152x64_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x152x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x152x64_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[76];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %83, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n152k64.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75},"
+      "{%76,  %77,  %78,  %79},"
+      " %80,"
+      " %81, %82,"
+      " p,    %84,  %85;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x152x64_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x160x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x160x64_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %44, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n160k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      " %40,"
+      " %41,"
+      " %42, %43,"
+      " p,   %45, %46;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x160x64_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x160x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x160x64_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %47, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n160k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      "{%40, %41, %42, %43},"
+      " %44,"
+      " %45, %46,"
+      " p,   %48, %49;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x160x64_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x160x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x160x64_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %84, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n160k64.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      " %80,"
+      " %81,"
+      " %82, %83,"
+      " p,    %85,  %86;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x160x64_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x160x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x160x64_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %87, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n160k64.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      "{%80,  %81,  %82,  %83},"
+      " %84,"
+      " %85, %86,"
+      " p,    %88,  %89;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x160x64_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x168x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x168x64_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[42];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %46, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n168k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41},"
+      " %42,"
+      " %43,"
+      " %44, %45,"
+      " p,   %47, %48;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x168x64_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x168x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x168x64_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[42];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %49, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n168k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41},"
+      "{%42, %43, %44, %45},"
+      " %46,"
+      " %47, %48,"
+      " p,   %50, %51;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x168x64_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x168x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x168x64_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[84];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %88, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n168k64.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83},"
+      " %84,"
+      " %85,"
+      " %86, %87,"
+      " p,    %89,  %90;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x168x64_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x168x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x168x64_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[84];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %91, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n168k64.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83},"
+      "{%84,  %85,  %86,  %87},"
+      " %88,"
+      " %89, %90,"
+      " p,    %92,  %93;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x168x64_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x176x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x176x64_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[44];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %48, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n176k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      " %44,"
+      " %45,"
+      " %46, %47,"
+      " p,   %49, %50;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x176x64_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x176x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x176x64_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[44];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %51, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n176k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      "{%44, %45, %46, %47},"
+      " %48,"
+      " %49, %50,"
+      " p,   %52, %53;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x176x64_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x176x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x176x64_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %92, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n176k64.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      " %88,"
+      " %89,"
+      " %90, %91,"
+      " p,    %93,  %94;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x176x64_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x176x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x176x64_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %95, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n176k64.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      "{%88,  %89,  %90,  %91},"
+      " %92,"
+      " %93, %94,"
+      " p,    %96,  %97;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x176x64_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x184x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x184x64_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[46];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %50, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n184k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45},"
+      " %46,"
+      " %47,"
+      " %48, %49,"
+      " p,   %51, %52;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x184x64_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x184x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x184x64_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[46];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %53, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n184k64.f16.e4m3.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45},"
+      "{%46, %47, %48, %49},"
+      " %50,"
+      " %51, %52,"
+      " p,   %54, %55;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x184x64_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x184x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x184x64_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[92];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %96, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n184k64.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91},"
+      " %92,"
+      " %93,"
+      " %94, %95,"
+      " p,    %97,  %98;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x184x64_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x184x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x184x64_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[92];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %99, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n184k64.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91},"
+      "{%92,  %93,  %94,  %95},"
+      " %96,"
+      " %97, %98,"
+      " p,    %100, %101;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x184x64_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x200x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x200x64_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[50];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %54, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n200k64.f16.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49},"
+      " %50,"
+      " %51,"
+      " %52, %53,"
+      " p,    %55,  %56;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x200x64_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x200x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x200x64_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[50];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %57, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n200k64.f16.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49},"
+      "{%50,  %51,  %52,  %53},"
+      " %54,"
+      " %55, %56,"
+      " p,    %58,  %59;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x200x64_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x200x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x200x64_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[100];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %104, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n200k64.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99},"
+      " %100,"
+      " %101,"
+      " %102, %103,"
+      " p,    %105, %106;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x200x64_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x200x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x200x64_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[100];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %107, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n200k64.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99},"
+      "{%100, %101, %102, %103},"
+      " %104,"
+      " %105, %106,"
+      " p,    %108, %109;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x200x64_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x208x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x208x64_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[52];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %56, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n208k64.f16.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      " %52,"
+      " %53,"
+      " %54, %55,"
+      " p,    %57,  %58;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x208x64_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x208x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x208x64_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[52];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %59, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n208k64.f16.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      "{%52,  %53,  %54,  %55},"
+      " %56,"
+      " %57, %58,"
+      " p,    %60,  %61;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x208x64_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x208x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x208x64_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %108, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n208k64.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      " %104,"
+      " %105,"
+      " %106, %107,"
+      " p,    %109, %110;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x208x64_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x208x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x208x64_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %111, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n208k64.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      "{%104, %105, %106, %107},"
+      " %108,"
+      " %109, %110,"
+      " p,    %112, %113;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x208x64_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x216x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x216x64_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[54];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %58, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n216k64.f16.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53},"
+      " %54,"
+      " %55,"
+      " %56, %57,"
+      " p,    %59,  %60;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x216x64_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x216x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x216x64_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[54];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %61, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n216k64.f16.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53},"
+      "{%54,  %55,  %56,  %57},"
+      " %58,"
+      " %59, %60,"
+      " p,    %62,  %63;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x216x64_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x216x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x216x64_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[108];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %112, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n216k64.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107},"
+      " %108,"
+      " %109,"
+      " %110, %111,"
+      " p,    %113, %114;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x216x64_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x216x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x216x64_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[108];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %115, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n216k64.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107},"
+      "{%108, %109, %110, %111},"
+      " %112,"
+      " %113, %114,"
+      " p,    %116, %117;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x216x64_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x224x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x224x64_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %60, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n224k64.f16.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      " %56,"
+      " %57,"
+      " %58, %59,"
+      " p,    %61,  %62;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x224x64_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x224x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x224x64_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %63, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n224k64.f16.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      "{%56,  %57,  %58,  %59},"
+      " %60,"
+      " %61, %62,"
+      " p,    %64,  %65;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x224x64_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x224x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x224x64_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %116, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n224k64.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      " %112,"
+      " %113,"
+      " %114, %115,"
+      " p,    %117, %118;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x224x64_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x224x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x224x64_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %119, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n224k64.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      "{%112, %113, %114, %115},"
+      " %116,"
+      " %117, %118,"
+      " p,    %120, %121;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x224x64_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x232x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x232x64_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[58];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %62, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n232k64.f16.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57},"
+      " %58,"
+      " %59,"
+      " %60, %61,"
+      " p,    %63,  %64;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x232x64_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x232x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x232x64_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[58];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %65, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n232k64.f16.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57},"
+      "{%58,  %59,  %60,  %61},"
+      " %62,"
+      " %63, %64,"
+      " p,    %66,  %67;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x232x64_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x232x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x232x64_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[116];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %120, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n232k64.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115},"
+      " %116,"
+      " %117,"
+      " %118, %119,"
+      " p,    %121, %122;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x232x64_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x232x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x232x64_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[116];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %123, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n232k64.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115},"
+      "{%116, %117, %118, %119},"
+      " %120,"
+      " %121, %122,"
+      " p,    %124, %125;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x232x64_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x240x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x240x64_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[60];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %64, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n240k64.f16.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      " %60,"
+      " %61,"
+      " %62, %63,"
+      " p,    %65,  %66;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x240x64_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x240x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x240x64_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[60];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %67, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n240k64.f16.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      "{%60,  %61,  %62,  %63},"
+      " %64,"
+      " %65, %66,"
+      " p,    %68,  %69;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x240x64_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x240x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x240x64_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %124, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n240k64.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      " %120,"
+      " %121,"
+      " %122, %123,"
+      " p,    %125, %126;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x240x64_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x240x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x240x64_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %127, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n240k64.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      "{%120, %121, %122, %123},"
+      " %124,"
+      " %125, %126,"
+      " p,    %128, %129;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x240x64_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x248x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x248x64_F16E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[62];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %66, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n248k64.f16.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61},"
+      " %62,"
+      " %63,"
+      " %64, %65,"
+      " p,    %67,  %68;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x248x64_F16E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x248x64 TN F16+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x248x64_F16E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[62];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %69, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n248k64.f16.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61},"
+      "{%62,  %63,  %64,  %65},"
+      " %66,"
+      " %67, %68,"
+      " p,    %70,  %71;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x248x64_F16E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x248x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x248x64_F32E4M3E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[124];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %128, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n248k64.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123},"
+      " %124,"
+      " %125,"
+      " %126, %127,"
+      " p,    %129, %130;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x248x64_F32E4M3E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x248x64 TN F32+=E4M3*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x248x64_F32E4M3E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[124];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %131, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n248k64.f32.e4m3.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123},"
+      "{%124, %125, %126, %127},"
+      " %128,"
+      " %129, %130,"
+      " p,    %132, %133;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x248x64_F32E4M3E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x24x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x24x64_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[6];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %10, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n24k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5},"
+      " %6,"
+      " %7,"
+      " %8, %9,"
+      " p,   %11, %12;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x24x64_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x24x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x24x64_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[6];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %13, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n24k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5},"
+      "{%6,  %7,  %8,  %9},"
+      " %10,"
+      " %11, %12,"
+      " p,   %14, %15;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x24x64_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x24x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x24x64_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %16, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n24k64.f32.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      " %12,"
+      " %13,"
+      " %14, %15,"
+      " p,   %17, %18;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x24x64_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x24x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x24x64_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %19, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n24k64.f32.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      "{%12, %13, %14, %15},"
+      " %16,"
+      " %17, %18,"
+      " p,   %20, %21;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x24x64_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x40x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x40x64_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[10];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %14, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n40k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9},"
+      " %10,"
+      " %11,"
+      " %12, %13,"
+      " p,   %15, %16;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x40x64_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x40x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x40x64_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[10];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %17, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n40k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9},"
+      "{%10, %11, %12, %13},"
+      " %14,"
+      " %15, %16,"
+      " p,   %18, %19;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x40x64_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x40x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x40x64_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[20];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %24, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n40k64.f32.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      " %20,"
+      " %21,"
+      " %22, %23,"
+      " p,   %25, %26;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x40x64_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x40x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x40x64_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[20];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %27, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n40k64.f32.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      "{%20, %21, %22, %23},"
+      " %24,"
+      " %25, %26,"
+      " p,   %28, %29;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x40x64_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x48x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x48x64_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %16, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n48k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      " %12,"
+      " %13,"
+      " %14, %15,"
+      " p,   %17, %18;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x48x64_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x48x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x48x64_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %19, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n48k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      "{%12, %13, %14, %15},"
+      " %16,"
+      " %17, %18,"
+      " p,   %20, %21;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x48x64_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x48x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x48x64_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %28, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n48k64.f32.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      " %24,"
+      " %25,"
+      " %26, %27,"
+      " p,   %29, %30;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x48x64_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x48x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x48x64_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %31, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n48k64.f32.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      "{%24, %25, %26, %27},"
+      " %28,"
+      " %29, %30,"
+      " p,   %32, %33;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x48x64_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x56x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x56x64_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[14];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %18, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n56k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13},"
+      " %14,"
+      " %15,"
+      " %16, %17,"
+      " p,   %19, %20;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x56x64_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x56x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x56x64_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[14];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %21, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n56k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13},"
+      "{%14, %15, %16, %17},"
+      " %18,"
+      " %19, %20,"
+      " p,   %22, %23;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x56x64_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x56x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x56x64_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[28];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %32, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n56k64.f32.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      " %28,"
+      " %29,"
+      " %30, %31,"
+      " p,   %33, %34;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x56x64_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x56x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x56x64_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[28];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %35, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n56k64.f32.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      "{%28, %29, %30, %31},"
+      " %32,"
+      " %33, %34,"
+      " p,   %36, %37;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x56x64_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x72x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x72x64_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[18];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %22, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n72k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17},"
+      " %18,"
+      " %19,"
+      " %20, %21,"
+      " p,   %23, %24;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x72x64_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x72x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x72x64_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[18];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %25, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n72k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17},"
+      "{%18, %19, %20, %21},"
+      " %22,"
+      " %23, %24,"
+      " p,   %26, %27;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x72x64_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x72x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x72x64_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[36];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %40, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n72k64.f32.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      " %36,"
+      " %37,"
+      " %38, %39,"
+      " p,   %41, %42;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x72x64_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x72x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x72x64_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[36];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %43, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n72k64.f32.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      "{%36, %37, %38, %39},"
+      " %40,"
+      " %41, %42,"
+      " p,   %44, %45;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x72x64_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x80x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x80x64_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[20];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %24, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n80k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      " %20,"
+      " %21,"
+      " %22, %23,"
+      " p,   %25, %26;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x80x64_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x80x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x80x64_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[20];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %27, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n80k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      "{%20, %21, %22, %23},"
+      " %24,"
+      " %25, %26,"
+      " p,   %28, %29;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x80x64_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x80x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x80x64_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %44, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n80k64.f32.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      " %40,"
+      " %41,"
+      " %42, %43,"
+      " p,   %45, %46;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x80x64_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x80x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x80x64_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %47, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n80k64.f32.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      "{%40, %41, %42, %43},"
+      " %44,"
+      " %45, %46,"
+      " p,   %48, %49;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x80x64_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x88x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x88x64_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[22];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %26, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n88k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21},"
+      " %22,"
+      " %23,"
+      " %24, %25,"
+      " p,   %27, %28;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x88x64_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x88x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x88x64_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[22];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %29, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n88k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21},"
+      "{%22, %23, %24, %25},"
+      " %26,"
+      " %27, %28,"
+      " p,   %30, %31;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x88x64_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x88x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x88x64_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[44];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %48, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n88k64.f32.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      " %44,"
+      " %45,"
+      " %46, %47,"
+      " p,   %49, %50;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x88x64_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x88x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x88x64_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[44];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %51, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n88k64.f32.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      "{%44, %45, %46, %47},"
+      " %48,"
+      " %49, %50,"
+      " p,   %52, %53;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x88x64_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x104x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x104x64_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[26];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %30, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n104k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25},"
+      " %26,"
+      " %27,"
+      " %28, %29,"
+      " p,   %31, %32;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x104x64_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x104x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x104x64_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[26];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %33, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n104k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25},"
+      "{%26, %27, %28, %29},"
+      " %30,"
+      " %31, %32,"
+      " p,   %34, %35;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x104x64_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x104x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x104x64_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[52];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %56, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n104k64.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      " %52,"
+      " %53,"
+      " %54, %55,"
+      " p,    %57,  %58;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x104x64_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x104x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x104x64_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[52];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %59, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n104k64.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      "{%52,  %53,  %54,  %55},"
+      " %56,"
+      " %57, %58,"
+      " p,    %60,  %61;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x104x64_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x112x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x112x64_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[28];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %32, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n112k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      " %28,"
+      " %29,"
+      " %30, %31,"
+      " p,   %33, %34;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x112x64_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x112x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x112x64_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[28];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %35, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n112k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      "{%28, %29, %30, %31},"
+      " %32,"
+      " %33, %34,"
+      " p,   %36, %37;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x112x64_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x112x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x112x64_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %60, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n112k64.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      " %56,"
+      " %57,"
+      " %58, %59,"
+      " p,    %61,  %62;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x112x64_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x112x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x112x64_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %63, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n112k64.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      "{%56,  %57,  %58,  %59},"
+      " %60,"
+      " %61, %62,"
+      " p,    %64,  %65;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x112x64_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x120x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x120x64_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[30];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %34, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n120k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29},"
+      " %30,"
+      " %31,"
+      " %32, %33,"
+      " p,   %35, %36;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x120x64_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x120x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x120x64_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[30];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %37, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n120k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29},"
+      "{%30, %31, %32, %33},"
+      " %34,"
+      " %35, %36,"
+      " p,   %38, %39;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x120x64_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x120x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x120x64_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[60];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %64, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n120k64.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      " %60,"
+      " %61,"
+      " %62, %63,"
+      " p,    %65,  %66;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x120x64_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x120x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x120x64_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[60];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %67, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n120k64.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      "{%60,  %61,  %62,  %63},"
+      " %64,"
+      " %65, %66,"
+      " p,    %68,  %69;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x120x64_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x136x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x136x64_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[34];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %38, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n136k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33},"
+      " %34,"
+      " %35,"
+      " %36, %37,"
+      " p,   %39, %40;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x136x64_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x136x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x136x64_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[34];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %41, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n136k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33},"
+      "{%34, %35, %36, %37},"
+      " %38,"
+      " %39, %40,"
+      " p,   %42, %43;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x136x64_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x136x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x136x64_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[68];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %72, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n136k64.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67},"
+      " %68,"
+      " %69,"
+      " %70, %71,"
+      " p,    %73,  %74;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x136x64_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x136x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x136x64_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[68];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %75, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n136k64.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67},"
+      "{%68,  %69,  %70,  %71},"
+      " %72,"
+      " %73, %74,"
+      " p,    %76,  %77;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x136x64_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x144x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x144x64_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[36];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %40, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n144k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      " %36,"
+      " %37,"
+      " %38, %39,"
+      " p,   %41, %42;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x144x64_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x144x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x144x64_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[36];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %43, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n144k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      "{%36, %37, %38, %39},"
+      " %40,"
+      " %41, %42,"
+      " p,   %44, %45;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x144x64_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x144x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x144x64_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %76, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n144k64.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      " %72,"
+      " %73,"
+      " %74, %75,"
+      " p,    %77,  %78;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x144x64_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x144x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x144x64_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %79, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n144k64.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      "{%72,  %73,  %74,  %75},"
+      " %76,"
+      " %77, %78,"
+      " p,    %80,  %81;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x144x64_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x152x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x152x64_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[38];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %42, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n152k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37},"
+      " %38,"
+      " %39,"
+      " %40, %41,"
+      " p,   %43, %44;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x152x64_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x152x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x152x64_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[38];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %45, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n152k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37},"
+      "{%38, %39, %40, %41},"
+      " %42,"
+      " %43, %44,"
+      " p,   %46, %47;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x152x64_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x152x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x152x64_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[76];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %80, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n152k64.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75},"
+      " %76,"
+      " %77,"
+      " %78, %79,"
+      " p,    %81,  %82;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x152x64_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x152x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x152x64_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[76];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %83, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n152k64.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75},"
+      "{%76,  %77,  %78,  %79},"
+      " %80,"
+      " %81, %82,"
+      " p,    %84,  %85;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x152x64_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x160x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x160x64_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %44, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n160k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      " %40,"
+      " %41,"
+      " %42, %43,"
+      " p,   %45, %46;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x160x64_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x160x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x160x64_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %47, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n160k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      "{%40, %41, %42, %43},"
+      " %44,"
+      " %45, %46,"
+      " p,   %48, %49;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x160x64_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x160x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x160x64_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %84, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n160k64.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      " %80,"
+      " %81,"
+      " %82, %83,"
+      " p,    %85,  %86;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x160x64_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x160x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x160x64_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %87, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n160k64.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      "{%80,  %81,  %82,  %83},"
+      " %84,"
+      " %85, %86,"
+      " p,    %88,  %89;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x160x64_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x168x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x168x64_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[42];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %46, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n168k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41},"
+      " %42,"
+      " %43,"
+      " %44, %45,"
+      " p,   %47, %48;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x168x64_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x168x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x168x64_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[42];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %49, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n168k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41},"
+      "{%42, %43, %44, %45},"
+      " %46,"
+      " %47, %48,"
+      " p,   %50, %51;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x168x64_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x168x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x168x64_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[84];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %88, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n168k64.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83},"
+      " %84,"
+      " %85,"
+      " %86, %87,"
+      " p,    %89,  %90;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x168x64_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x168x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x168x64_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[84];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %91, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n168k64.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83},"
+      "{%84,  %85,  %86,  %87},"
+      " %88,"
+      " %89, %90,"
+      " p,    %92,  %93;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x168x64_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x176x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x176x64_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[44];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %48, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n176k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      " %44,"
+      " %45,"
+      " %46, %47,"
+      " p,   %49, %50;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x176x64_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x176x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x176x64_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[44];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %51, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n176k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      "{%44, %45, %46, %47},"
+      " %48,"
+      " %49, %50,"
+      " p,   %52, %53;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x176x64_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x176x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x176x64_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %92, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n176k64.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      " %88,"
+      " %89,"
+      " %90, %91,"
+      " p,    %93,  %94;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x176x64_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x176x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x176x64_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %95, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n176k64.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      "{%88,  %89,  %90,  %91},"
+      " %92,"
+      " %93, %94,"
+      " p,    %96,  %97;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x176x64_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x184x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x184x64_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[46];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %50, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n184k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45},"
+      " %46,"
+      " %47,"
+      " %48, %49,"
+      " p,   %51, %52;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x184x64_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x184x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x184x64_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[46];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %53, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n184k64.f16.e5m2.e4m3 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45},"
+      "{%46, %47, %48, %49},"
+      " %50,"
+      " %51, %52,"
+      " p,   %54, %55;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x184x64_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x184x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x184x64_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[92];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %96, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n184k64.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91},"
+      " %92,"
+      " %93,"
+      " %94, %95,"
+      " p,    %97,  %98;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x184x64_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x184x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x184x64_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[92];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %99, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n184k64.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91},"
+      "{%92,  %93,  %94,  %95},"
+      " %96,"
+      " %97, %98,"
+      " p,    %100, %101;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x184x64_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x200x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x200x64_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[50];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %54, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n200k64.f16.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49},"
+      " %50,"
+      " %51,"
+      " %52, %53,"
+      " p,    %55,  %56;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x200x64_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x200x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x200x64_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[50];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %57, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n200k64.f16.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49},"
+      "{%50,  %51,  %52,  %53},"
+      " %54,"
+      " %55, %56,"
+      " p,    %58,  %59;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x200x64_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x200x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x200x64_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[100];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %104, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n200k64.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99},"
+      " %100,"
+      " %101,"
+      " %102, %103,"
+      " p,    %105, %106;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x200x64_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x200x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x200x64_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[100];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %107, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n200k64.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99},"
+      "{%100, %101, %102, %103},"
+      " %104,"
+      " %105, %106,"
+      " p,    %108, %109;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x200x64_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x208x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x208x64_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[52];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %56, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n208k64.f16.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      " %52,"
+      " %53,"
+      " %54, %55,"
+      " p,    %57,  %58;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x208x64_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x208x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x208x64_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[52];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %59, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n208k64.f16.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      "{%52,  %53,  %54,  %55},"
+      " %56,"
+      " %57, %58,"
+      " p,    %60,  %61;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x208x64_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x208x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x208x64_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %108, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n208k64.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      " %104,"
+      " %105,"
+      " %106, %107,"
+      " p,    %109, %110;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x208x64_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x208x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x208x64_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %111, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n208k64.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      "{%104, %105, %106, %107},"
+      " %108,"
+      " %109, %110,"
+      " p,    %112, %113;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x208x64_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x216x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x216x64_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[54];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %58, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n216k64.f16.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53},"
+      " %54,"
+      " %55,"
+      " %56, %57,"
+      " p,    %59,  %60;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x216x64_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x216x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x216x64_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[54];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %61, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n216k64.f16.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53},"
+      "{%54,  %55,  %56,  %57},"
+      " %58,"
+      " %59, %60,"
+      " p,    %62,  %63;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x216x64_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x216x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x216x64_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[108];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %112, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n216k64.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107},"
+      " %108,"
+      " %109,"
+      " %110, %111,"
+      " p,    %113, %114;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x216x64_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x216x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x216x64_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[108];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %115, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n216k64.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107},"
+      "{%108, %109, %110, %111},"
+      " %112,"
+      " %113, %114,"
+      " p,    %116, %117;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x216x64_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x224x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x224x64_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %60, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n224k64.f16.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      " %56,"
+      " %57,"
+      " %58, %59,"
+      " p,    %61,  %62;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x224x64_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x224x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x224x64_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %63, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n224k64.f16.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      "{%56,  %57,  %58,  %59},"
+      " %60,"
+      " %61, %62,"
+      " p,    %64,  %65;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x224x64_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x224x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x224x64_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %116, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n224k64.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      " %112,"
+      " %113,"
+      " %114, %115,"
+      " p,    %117, %118;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x224x64_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x224x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x224x64_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %119, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n224k64.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      "{%112, %113, %114, %115},"
+      " %116,"
+      " %117, %118,"
+      " p,    %120, %121;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x224x64_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x232x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x232x64_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[58];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %62, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n232k64.f16.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57},"
+      " %58,"
+      " %59,"
+      " %60, %61,"
+      " p,    %63,  %64;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x232x64_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x232x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x232x64_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[58];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %65, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n232k64.f16.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57},"
+      "{%58,  %59,  %60,  %61},"
+      " %62,"
+      " %63, %64,"
+      " p,    %66,  %67;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x232x64_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x232x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x232x64_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[116];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %120, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n232k64.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115},"
+      " %116,"
+      " %117,"
+      " %118, %119,"
+      " p,    %121, %122;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x232x64_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x232x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x232x64_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[116];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %123, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n232k64.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115},"
+      "{%116, %117, %118, %119},"
+      " %120,"
+      " %121, %122,"
+      " p,    %124, %125;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x232x64_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x240x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x240x64_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[60];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %64, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n240k64.f16.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      " %60,"
+      " %61,"
+      " %62, %63,"
+      " p,    %65,  %66;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x240x64_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x240x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x240x64_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[60];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %67, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n240k64.f16.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      "{%60,  %61,  %62,  %63},"
+      " %64,"
+      " %65, %66,"
+      " p,    %68,  %69;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x240x64_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x240x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x240x64_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %124, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n240k64.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      " %120,"
+      " %121,"
+      " %122, %123,"
+      " p,    %125, %126;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x240x64_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x240x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x240x64_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %127, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n240k64.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      "{%120, %121, %122, %123},"
+      " %124,"
+      " %125, %126,"
+      " p,    %128, %129;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x240x64_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x248x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x248x64_F16E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[62];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %66, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n248k64.f16.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61},"
+      " %62,"
+      " %63,"
+      " %64, %65,"
+      " p,    %67,  %68;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x248x64_F16E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x248x64 TN F16+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x248x64_F16E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[62];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %69, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n248k64.f16.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61},"
+      "{%62,  %63,  %64,  %65},"
+      " %66,"
+      " %67, %68,"
+      " p,    %70,  %71;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x248x64_F16E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x248x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x248x64_F32E5M2E4M3_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[124];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %128, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n248k64.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123},"
+      " %124,"
+      " %125,"
+      " %126, %127,"
+      " p,    %129, %130;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x248x64_F32E5M2E4M3_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x248x64 TN F32+=E5M2*E4M3
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x248x64_F32E5M2E4M3_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[124];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %131, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n248k64.f32.e5m2.e4m3 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123},"
+      "{%124, %125, %126, %127},"
+      " %128,"
+      " %129, %130,"
+      " p,    %132, %133;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x248x64_F32E5M2E4M3_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x24x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x24x64_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[6];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %10, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n24k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5},"
+      " %6,"
+      " %7,"
+      " %8, %9,"
+      " p,   %11, %12;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x24x64_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x24x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x24x64_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[6];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
+      uint64_t const& desc_b,
+      uint32_t      & d0, uint32_t      & d1, uint32_t      & d2, uint32_t      & d3,
+      uint32_t      & d4, uint32_t      & d5,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %13, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n24k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5},"
+      "{%6,  %7,  %8,  %9},"
+      " %10,"
+      " %11, %12,"
+      " p,   %14, %15;\n"
+    "}\n"
+      : "+r"(d0), "+r"(d1), "+r"(d2), "+r"(d3),
+        "+r"(d4), "+r"(d5)
+      :  "r"(a0),  "r"(a1),  "r"(a2),  "r"(a3),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x24x64_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x24x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x24x64_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %16, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n24k64.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      " %12,"
+      " %13,"
+      " %14, %15,"
+      " p,   %17, %18;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x24x64_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x24x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x24x64_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %19, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n24k64.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      "{%12, %13, %14, %15},"
+      " %16,"
+      " %17, %18,"
+      " p,   %20, %21;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x24x64_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x40x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x40x64_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[10];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %14, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n40k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9},"
+      " %10,"
+      " %11,"
+      " %12, %13,"
+      " p,   %15, %16;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x40x64_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x40x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x40x64_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[10];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %17, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n40k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9},"
+      "{%10, %11, %12, %13},"
+      " %14,"
+      " %15, %16,"
+      " p,   %18, %19;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x40x64_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x40x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x40x64_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[20];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %24, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n40k64.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      " %20,"
+      " %21,"
+      " %22, %23,"
+      " p,   %25, %26;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x40x64_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x40x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x40x64_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[20];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %27, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n40k64.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      "{%20, %21, %22, %23},"
+      " %24,"
+      " %25, %26,"
+      " p,   %28, %29;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x40x64_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x48x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x48x64_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %16, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n48k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      " %12,"
+      " %13,"
+      " %14, %15,"
+      " p,   %17, %18;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x48x64_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x48x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x48x64_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[12];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %19, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n48k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11},"
+      "{%12, %13, %14, %15},"
+      " %16,"
+      " %17, %18,"
+      " p,   %20, %21;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x48x64_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x48x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x48x64_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %28, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n48k64.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      " %24,"
+      " %25,"
+      " %26, %27,"
+      " p,   %29, %30;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x48x64_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x48x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x48x64_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[24];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %31, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n48k64.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23},"
+      "{%24, %25, %26, %27},"
+      " %28,"
+      " %29, %30,"
+      " p,   %32, %33;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x48x64_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x56x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x56x64_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[14];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %18, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n56k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13},"
+      " %14,"
+      " %15,"
+      " %16, %17,"
+      " p,   %19, %20;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x56x64_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x56x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x56x64_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[14];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %21, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n56k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13},"
+      "{%14, %15, %16, %17},"
+      " %18,"
+      " %19, %20,"
+      " p,   %22, %23;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x56x64_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x56x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x56x64_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[28];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %32, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n56k64.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      " %28,"
+      " %29,"
+      " %30, %31,"
+      " p,   %33, %34;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x56x64_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x56x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x56x64_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[28];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %35, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n56k64.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      "{%28, %29, %30, %31},"
+      " %32,"
+      " %33, %34,"
+      " p,   %36, %37;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x56x64_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x72x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x72x64_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[18];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %22, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n72k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17},"
+      " %18,"
+      " %19,"
+      " %20, %21,"
+      " p,   %23, %24;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x72x64_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x72x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x72x64_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[18];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %25, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n72k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17},"
+      "{%18, %19, %20, %21},"
+      " %22,"
+      " %23, %24,"
+      " p,   %26, %27;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x72x64_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x72x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x72x64_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[36];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %40, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n72k64.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      " %36,"
+      " %37,"
+      " %38, %39,"
+      " p,   %41, %42;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x72x64_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x72x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x72x64_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[36];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %43, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n72k64.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      "{%36, %37, %38, %39},"
+      " %40,"
+      " %41, %42,"
+      " p,   %44, %45;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x72x64_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x80x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x80x64_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[20];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %24, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n80k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      " %20,"
+      " %21,"
+      " %22, %23,"
+      " p,   %25, %26;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x80x64_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x80x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x80x64_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[20];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %27, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n80k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19},"
+      "{%20, %21, %22, %23},"
+      " %24,"
+      " %25, %26,"
+      " p,   %28, %29;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x80x64_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x80x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x80x64_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %44, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n80k64.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      " %40,"
+      " %41,"
+      " %42, %43,"
+      " p,   %45, %46;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x80x64_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x80x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x80x64_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %47, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n80k64.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      "{%40, %41, %42, %43},"
+      " %44,"
+      " %45, %46,"
+      " p,   %48, %49;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x80x64_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x88x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x88x64_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[22];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %26, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n88k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21},"
+      " %22,"
+      " %23,"
+      " %24, %25,"
+      " p,   %27, %28;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x88x64_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x88x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x88x64_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[22];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %29, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n88k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21},"
+      "{%22, %23, %24, %25},"
+      " %26,"
+      " %27, %28,"
+      " p,   %30, %31;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x88x64_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x88x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x88x64_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[44];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %48, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n88k64.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      " %44,"
+      " %45,"
+      " %46, %47,"
+      " p,   %49, %50;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x88x64_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x88x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x88x64_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[44];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %51, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n88k64.f32.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      "{%44, %45, %46, %47},"
+      " %48,"
+      " %49, %50,"
+      " p,   %52, %53;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x88x64_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x104x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x104x64_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[26];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %30, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n104k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25},"
+      " %26,"
+      " %27,"
+      " %28, %29,"
+      " p,   %31, %32;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x104x64_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x104x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x104x64_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[26];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %33, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n104k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25},"
+      "{%26, %27, %28, %29},"
+      " %30,"
+      " %31, %32,"
+      " p,   %34, %35;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x104x64_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x104x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x104x64_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[52];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %56, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n104k64.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      " %52,"
+      " %53,"
+      " %54, %55,"
+      " p,    %57,  %58;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x104x64_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x104x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x104x64_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[52];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %59, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n104k64.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      "{%52,  %53,  %54,  %55},"
+      " %56,"
+      " %57, %58,"
+      " p,    %60,  %61;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x104x64_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x112x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x112x64_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[28];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %32, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n112k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      " %28,"
+      " %29,"
+      " %30, %31,"
+      " p,   %33, %34;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x112x64_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x112x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x112x64_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[28];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %35, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n112k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27},"
+      "{%28, %29, %30, %31},"
+      " %32,"
+      " %33, %34,"
+      " p,   %36, %37;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x112x64_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x112x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x112x64_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %60, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n112k64.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      " %56,"
+      " %57,"
+      " %58, %59,"
+      " p,    %61,  %62;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x112x64_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x112x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x112x64_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %63, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n112k64.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      "{%56,  %57,  %58,  %59},"
+      " %60,"
+      " %61, %62,"
+      " p,    %64,  %65;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x112x64_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x120x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x120x64_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[30];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %34, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n120k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29},"
+      " %30,"
+      " %31,"
+      " %32, %33,"
+      " p,   %35, %36;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x120x64_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x120x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x120x64_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[30];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %37, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n120k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29},"
+      "{%30, %31, %32, %33},"
+      " %34,"
+      " %35, %36,"
+      " p,   %38, %39;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x120x64_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x120x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x120x64_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[60];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %64, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n120k64.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      " %60,"
+      " %61,"
+      " %62, %63,"
+      " p,    %65,  %66;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x120x64_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x120x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x120x64_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[60];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %67, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n120k64.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      "{%60,  %61,  %62,  %63},"
+      " %64,"
+      " %65, %66,"
+      " p,    %68,  %69;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x120x64_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x136x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x136x64_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[34];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %38, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n136k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33},"
+      " %34,"
+      " %35,"
+      " %36, %37,"
+      " p,   %39, %40;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x136x64_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x136x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x136x64_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[34];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %41, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n136k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33},"
+      "{%34, %35, %36, %37},"
+      " %38,"
+      " %39, %40,"
+      " p,   %42, %43;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x136x64_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x136x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x136x64_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[68];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %72, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n136k64.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67},"
+      " %68,"
+      " %69,"
+      " %70, %71,"
+      " p,    %73,  %74;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x136x64_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x136x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x136x64_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[68];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %75, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n136k64.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67},"
+      "{%68,  %69,  %70,  %71},"
+      " %72,"
+      " %73, %74,"
+      " p,    %76,  %77;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x136x64_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x144x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x144x64_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[36];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %40, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n144k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      " %36,"
+      " %37,"
+      " %38, %39,"
+      " p,   %41, %42;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x144x64_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x144x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x144x64_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[36];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %43, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n144k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35},"
+      "{%36, %37, %38, %39},"
+      " %40,"
+      " %41, %42,"
+      " p,   %44, %45;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x144x64_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x144x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x144x64_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %76, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n144k64.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      " %72,"
+      " %73,"
+      " %74, %75,"
+      " p,    %77,  %78;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x144x64_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x144x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x144x64_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[72];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %79, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n144k64.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71},"
+      "{%72,  %73,  %74,  %75},"
+      " %76,"
+      " %77, %78,"
+      " p,    %80,  %81;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x144x64_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x152x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x152x64_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[38];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %42, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n152k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37},"
+      " %38,"
+      " %39,"
+      " %40, %41,"
+      " p,   %43, %44;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x152x64_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x152x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x152x64_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[38];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %45, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n152k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37},"
+      "{%38, %39, %40, %41},"
+      " %42,"
+      " %43, %44,"
+      " p,   %46, %47;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x152x64_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x152x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x152x64_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[76];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %80, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n152k64.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75},"
+      " %76,"
+      " %77,"
+      " %78, %79,"
+      " p,    %81,  %82;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x152x64_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x152x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x152x64_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[76];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %83, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n152k64.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75},"
+      "{%76,  %77,  %78,  %79},"
+      " %80,"
+      " %81, %82,"
+      " p,    %84,  %85;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x152x64_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x160x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x160x64_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %44, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n160k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      " %40,"
+      " %41,"
+      " %42, %43,"
+      " p,   %45, %46;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x160x64_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x160x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x160x64_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[40];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %47, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n160k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39},"
+      "{%40, %41, %42, %43},"
+      " %44,"
+      " %45, %46,"
+      " p,   %48, %49;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x160x64_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x160x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x160x64_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %84, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n160k64.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      " %80,"
+      " %81,"
+      " %82, %83,"
+      " p,    %85,  %86;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x160x64_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x160x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x160x64_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[80];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %87, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n160k64.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79},"
+      "{%80,  %81,  %82,  %83},"
+      " %84,"
+      " %85, %86,"
+      " p,    %88,  %89;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x160x64_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x168x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x168x64_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[42];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %46, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n168k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41},"
+      " %42,"
+      " %43,"
+      " %44, %45,"
+      " p,   %47, %48;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x168x64_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x168x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x168x64_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[42];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %49, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n168k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41},"
+      "{%42, %43, %44, %45},"
+      " %46,"
+      " %47, %48,"
+      " p,   %50, %51;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x168x64_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x168x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x168x64_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[84];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %88, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n168k64.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83},"
+      " %84,"
+      " %85,"
+      " %86, %87,"
+      " p,    %89,  %90;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x168x64_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x168x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x168x64_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[84];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %91, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n168k64.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83},"
+      "{%84,  %85,  %86,  %87},"
+      " %88,"
+      " %89, %90,"
+      " p,    %92,  %93;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x168x64_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x176x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x176x64_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[44];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %48, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n176k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      " %44,"
+      " %45,"
+      " %46, %47,"
+      " p,   %49, %50;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x176x64_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x176x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x176x64_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[44];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %51, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n176k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43},"
+      "{%44, %45, %46, %47},"
+      " %48,"
+      " %49, %50,"
+      " p,   %52, %53;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x176x64_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x176x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x176x64_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %92, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n176k64.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      " %88,"
+      " %89,"
+      " %90, %91,"
+      " p,    %93,  %94;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x176x64_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x176x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x176x64_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[88];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %95, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n176k64.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87},"
+      "{%88,  %89,  %90,  %91},"
+      " %92,"
+      " %93, %94,"
+      " p,    %96,  %97;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x176x64_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x184x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x184x64_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[46];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %50, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n184k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45},"
+      " %46,"
+      " %47,"
+      " %48, %49,"
+      " p,   %51, %52;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x184x64_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x184x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x184x64_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[46];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %53, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n184k64.f16.e5m2.e5m2 "
+      "{%0,  %1,  %2,  %3,  %4,  %5,  %6,  %7,  "
+      " %8,  %9,  %10, %11, %12, %13, %14, %15, "
+      " %16, %17, %18, %19, %20, %21, %22, %23, "
+      " %24, %25, %26, %27, %28, %29, %30, %31, "
+      " %32, %33, %34, %35, %36, %37, %38, %39, "
+      " %40, %41, %42, %43, %44, %45},"
+      "{%46, %47, %48, %49},"
+      " %50,"
+      " %51, %52,"
+      " p,   %54, %55;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x184x64_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x184x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x184x64_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[92];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %96, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n184k64.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91},"
+      " %92,"
+      " %93,"
+      " %94, %95,"
+      " p,    %97,  %98;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x184x64_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x184x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x184x64_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[92];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      float         & d00, float         & d01, float         & d02, float         & d03,
+      float         & d04, float         & d05, float         & d06, float         & d07,
+      float         & d08, float         & d09, float         & d10, float         & d11,
+      float         & d12, float         & d13, float         & d14, float         & d15,
+      float         & d16, float         & d17, float         & d18, float         & d19,
+      float         & d20, float         & d21, float         & d22, float         & d23,
+      float         & d24, float         & d25, float         & d26, float         & d27,
+      float         & d28, float         & d29, float         & d30, float         & d31,
+      float         & d32, float         & d33, float         & d34, float         & d35,
+      float         & d36, float         & d37, float         & d38, float         & d39,
+      float         & d40, float         & d41, float         & d42, float         & d43,
+      float         & d44, float         & d45, float         & d46, float         & d47,
+      float         & d48, float         & d49, float         & d50, float         & d51,
+      float         & d52, float         & d53, float         & d54, float         & d55,
+      float         & d56, float         & d57, float         & d58, float         & d59,
+      float         & d60, float         & d61, float         & d62, float         & d63,
+      float         & d64, float         & d65, float         & d66, float         & d67,
+      float         & d68, float         & d69, float         & d70, float         & d71,
+      float         & d72, float         & d73, float         & d74, float         & d75,
+      float         & d76, float         & d77, float         & d78, float         & d79,
+      float         & d80, float         & d81, float         & d82, float         & d83,
+      float         & d84, float         & d85, float         & d86, float         & d87,
+      float         & d88, float         & d89, float         & d90, float         & d91,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %99, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n184k64.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91},"
+      "{%92,  %93,  %94,  %95},"
+      " %96,"
+      " %97, %98,"
+      " p,    %100, %101;\n"
+    "}\n"
+      : "+f"(d00), "+f"(d01), "+f"(d02), "+f"(d03),
+        "+f"(d04), "+f"(d05), "+f"(d06), "+f"(d07),
+        "+f"(d08), "+f"(d09), "+f"(d10), "+f"(d11),
+        "+f"(d12), "+f"(d13), "+f"(d14), "+f"(d15),
+        "+f"(d16), "+f"(d17), "+f"(d18), "+f"(d19),
+        "+f"(d20), "+f"(d21), "+f"(d22), "+f"(d23),
+        "+f"(d24), "+f"(d25), "+f"(d26), "+f"(d27),
+        "+f"(d28), "+f"(d29), "+f"(d30), "+f"(d31),
+        "+f"(d32), "+f"(d33), "+f"(d34), "+f"(d35),
+        "+f"(d36), "+f"(d37), "+f"(d38), "+f"(d39),
+        "+f"(d40), "+f"(d41), "+f"(d42), "+f"(d43),
+        "+f"(d44), "+f"(d45), "+f"(d46), "+f"(d47),
+        "+f"(d48), "+f"(d49), "+f"(d50), "+f"(d51),
+        "+f"(d52), "+f"(d53), "+f"(d54), "+f"(d55),
+        "+f"(d56), "+f"(d57), "+f"(d58), "+f"(d59),
+        "+f"(d60), "+f"(d61), "+f"(d62), "+f"(d63),
+        "+f"(d64), "+f"(d65), "+f"(d66), "+f"(d67),
+        "+f"(d68), "+f"(d69), "+f"(d70), "+f"(d71),
+        "+f"(d72), "+f"(d73), "+f"(d74), "+f"(d75),
+        "+f"(d76), "+f"(d77), "+f"(d78), "+f"(d79),
+        "+f"(d80), "+f"(d81), "+f"(d82), "+f"(d83),
+        "+f"(d84), "+f"(d85), "+f"(d86), "+f"(d87),
+        "+f"(d88), "+f"(d89), "+f"(d90), "+f"(d91)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x184x64_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x200x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x200x64_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[50];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %54, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n200k64.f16.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49},"
+      " %50,"
+      " %51,"
+      " %52, %53,"
+      " p,    %55,  %56;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x200x64_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x200x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x200x64_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[50];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %57, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n200k64.f16.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49},"
+      "{%50,  %51,  %52,  %53},"
+      " %54,"
+      " %55, %56,"
+      " p,    %58,  %59;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x200x64_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x200x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x200x64_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[100];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %104, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n200k64.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99},"
+      " %100,"
+      " %101,"
+      " %102, %103,"
+      " p,    %105, %106;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x200x64_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x200x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x200x64_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[100];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %107, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n200k64.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99},"
+      "{%100, %101, %102, %103},"
+      " %104,"
+      " %105, %106,"
+      " p,    %108, %109;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x200x64_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x208x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x208x64_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[52];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %56, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n208k64.f16.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      " %52,"
+      " %53,"
+      " %54, %55,"
+      " p,    %57,  %58;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x208x64_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x208x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x208x64_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[52];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %59, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n208k64.f16.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51},"
+      "{%52,  %53,  %54,  %55},"
+      " %56,"
+      " %57, %58,"
+      " p,    %60,  %61;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x208x64_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x208x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x208x64_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %108, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n208k64.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      " %104,"
+      " %105,"
+      " %106, %107,"
+      " p,    %109, %110;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x208x64_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x208x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x208x64_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[104];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %111, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n208k64.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103},"
+      "{%104, %105, %106, %107},"
+      " %108,"
+      " %109, %110,"
+      " p,    %112, %113;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x208x64_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x216x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x216x64_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[54];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %58, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n216k64.f16.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53},"
+      " %54,"
+      " %55,"
+      " %56, %57,"
+      " p,    %59,  %60;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x216x64_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x216x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x216x64_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[54];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %61, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n216k64.f16.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53},"
+      "{%54,  %55,  %56,  %57},"
+      " %58,"
+      " %59, %60,"
+      " p,    %62,  %63;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x216x64_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x216x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x216x64_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[108];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %112, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n216k64.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107},"
+      " %108,"
+      " %109,"
+      " %110, %111,"
+      " p,    %113, %114;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x216x64_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x216x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x216x64_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[108];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %115, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n216k64.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107},"
+      "{%108, %109, %110, %111},"
+      " %112,"
+      " %113, %114,"
+      " p,    %116, %117;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x216x64_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x224x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x224x64_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %60, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n224k64.f16.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      " %56,"
+      " %57,"
+      " %58, %59,"
+      " p,    %61,  %62;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x224x64_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x224x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x224x64_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[56];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %63, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n224k64.f16.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55},"
+      "{%56,  %57,  %58,  %59},"
+      " %60,"
+      " %61, %62,"
+      " p,    %64,  %65;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x224x64_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x224x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x224x64_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %116, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n224k64.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      " %112,"
+      " %113,"
+      " %114, %115,"
+      " p,    %117, %118;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x224x64_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x224x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x224x64_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[112];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %119, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n224k64.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111},"
+      "{%112, %113, %114, %115},"
+      " %116,"
+      " %117, %118,"
+      " p,    %120, %121;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x224x64_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x232x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x232x64_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[58];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %62, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n232k64.f16.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57},"
+      " %58,"
+      " %59,"
+      " %60, %61,"
+      " p,    %63,  %64;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x232x64_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x232x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x232x64_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[58];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %65, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n232k64.f16.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57},"
+      "{%58,  %59,  %60,  %61},"
+      " %62,"
+      " %63, %64,"
+      " p,    %66,  %67;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x232x64_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x232x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x232x64_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[116];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %120, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n232k64.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115},"
+      " %116,"
+      " %117,"
+      " %118, %119,"
+      " p,    %121, %122;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x232x64_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x232x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x232x64_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[116];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %123, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n232k64.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115},"
+      "{%116, %117, %118, %119},"
+      " %120,"
+      " %121, %122,"
+      " p,    %124, %125;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x232x64_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x240x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x240x64_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[60];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %64, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n240k64.f16.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      " %60,"
+      " %61,"
+      " %62, %63,"
+      " p,    %65,  %66;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x240x64_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x240x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x240x64_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[60];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %67, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n240k64.f16.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59},"
+      "{%60,  %61,  %62,  %63},"
+      " %64,"
+      " %65, %66,"
+      " p,    %68,  %69;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x240x64_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x240x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x240x64_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %124, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n240k64.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      " %120,"
+      " %121,"
+      " %122, %123,"
+      " p,    %125, %126;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x240x64_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x240x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x240x64_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[120];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %127, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n240k64.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119},"
+      "{%120, %121, %122, %123},"
+      " %124,"
+      " %125, %126,"
+      " p,    %128, %129;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x240x64_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x248x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x248x64_F16E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[62];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %66, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n248k64.f16.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61},"
+      " %62,"
+      " %63,"
+      " %64, %65,"
+      " p,    %67,  %68;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x248x64_F16E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x248x64 TN F16+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x248x64_F16E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = uint32_t[62];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a00, uint32_t const& a01, uint32_t const& a02, uint32_t const& a03,
+      uint64_t const& desc_b,
+      uint32_t      & d00, uint32_t      & d01, uint32_t      & d02, uint32_t      & d03,
+      uint32_t      & d04, uint32_t      & d05, uint32_t      & d06, uint32_t      & d07,
+      uint32_t      & d08, uint32_t      & d09, uint32_t      & d10, uint32_t      & d11,
+      uint32_t      & d12, uint32_t      & d13, uint32_t      & d14, uint32_t      & d15,
+      uint32_t      & d16, uint32_t      & d17, uint32_t      & d18, uint32_t      & d19,
+      uint32_t      & d20, uint32_t      & d21, uint32_t      & d22, uint32_t      & d23,
+      uint32_t      & d24, uint32_t      & d25, uint32_t      & d26, uint32_t      & d27,
+      uint32_t      & d28, uint32_t      & d29, uint32_t      & d30, uint32_t      & d31,
+      uint32_t      & d32, uint32_t      & d33, uint32_t      & d34, uint32_t      & d35,
+      uint32_t      & d36, uint32_t      & d37, uint32_t      & d38, uint32_t      & d39,
+      uint32_t      & d40, uint32_t      & d41, uint32_t      & d42, uint32_t      & d43,
+      uint32_t      & d44, uint32_t      & d45, uint32_t      & d46, uint32_t      & d47,
+      uint32_t      & d48, uint32_t      & d49, uint32_t      & d50, uint32_t      & d51,
+      uint32_t      & d52, uint32_t      & d53, uint32_t      & d54, uint32_t      & d55,
+      uint32_t      & d56, uint32_t      & d57, uint32_t      & d58, uint32_t      & d59,
+      uint32_t      & d60, uint32_t      & d61,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %69, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n248k64.f16.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61},"
+      "{%62,  %63,  %64,  %65},"
+      " %66,"
+      " %67, %68,"
+      " p,    %70,  %71;\n"
+    "}\n"
+      : "+r"(d00), "+r"(d01), "+r"(d02), "+r"(d03),
+        "+r"(d04), "+r"(d05), "+r"(d06), "+r"(d07),
+        "+r"(d08), "+r"(d09), "+r"(d10), "+r"(d11),
+        "+r"(d12), "+r"(d13), "+r"(d14), "+r"(d15),
+        "+r"(d16), "+r"(d17), "+r"(d18), "+r"(d19),
+        "+r"(d20), "+r"(d21), "+r"(d22), "+r"(d23),
+        "+r"(d24), "+r"(d25), "+r"(d26), "+r"(d27),
+        "+r"(d28), "+r"(d29), "+r"(d30), "+r"(d31),
+        "+r"(d32), "+r"(d33), "+r"(d34), "+r"(d35),
+        "+r"(d36), "+r"(d37), "+r"(d38), "+r"(d39),
+        "+r"(d40), "+r"(d41), "+r"(d42), "+r"(d43),
+        "+r"(d44), "+r"(d45), "+r"(d46), "+r"(d47),
+        "+r"(d48), "+r"(d49), "+r"(d50), "+r"(d51),
+        "+r"(d52), "+r"(d53), "+r"(d54), "+r"(d55),
+        "+r"(d56), "+r"(d57), "+r"(d58), "+r"(d59),
+        "+r"(d60), "+r"(d61)
+      :  "r"(a00),  "r"(a01),  "r"(a02),  "r"(a03),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x248x64_F16E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x248x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x248x64_F32E5M2E5M2_SS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint64_t[1];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[124];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint64_t const& desc_a,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_smem_smem(__LINE__, desc_a, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %128, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n248k64.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123},"
+      " %124,"
+      " %125,"
+      " %126, %127,"
+      " p,    %129, %130;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123)
+      :  "l"(desc_a),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x248x64_F32E5M2E5M2_SS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SPARSE GMMA 64x248x64 TN F32+=E5M2*E5M2
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One,
+  GMMA::SparseSel spsel = GMMA::SparseSel::Zero
+>
+struct GMMA_64x248x64_F32E5M2E5M2_RS_TN
+{
+  using DRegisters = void;
+  using ARegisters = uint32_t[4];
+  using ERegisters = uint32_t[1];
+  using BRegisters = uint64_t[1];
+  using CRegisters = float[124];
+
+  CUTE_HOST_DEVICE static void
+  fma(uint32_t const& a000, uint32_t const& a001, uint32_t const& a002, uint32_t const& a003,
+      uint64_t const& desc_b,
+      float         & d000, float         & d001, float         & d002, float         & d003,
+      float         & d004, float         & d005, float         & d006, float         & d007,
+      float         & d008, float         & d009, float         & d010, float         & d011,
+      float         & d012, float         & d013, float         & d014, float         & d015,
+      float         & d016, float         & d017, float         & d018, float         & d019,
+      float         & d020, float         & d021, float         & d022, float         & d023,
+      float         & d024, float         & d025, float         & d026, float         & d027,
+      float         & d028, float         & d029, float         & d030, float         & d031,
+      float         & d032, float         & d033, float         & d034, float         & d035,
+      float         & d036, float         & d037, float         & d038, float         & d039,
+      float         & d040, float         & d041, float         & d042, float         & d043,
+      float         & d044, float         & d045, float         & d046, float         & d047,
+      float         & d048, float         & d049, float         & d050, float         & d051,
+      float         & d052, float         & d053, float         & d054, float         & d055,
+      float         & d056, float         & d057, float         & d058, float         & d059,
+      float         & d060, float         & d061, float         & d062, float         & d063,
+      float         & d064, float         & d065, float         & d066, float         & d067,
+      float         & d068, float         & d069, float         & d070, float         & d071,
+      float         & d072, float         & d073, float         & d074, float         & d075,
+      float         & d076, float         & d077, float         & d078, float         & d079,
+      float         & d080, float         & d081, float         & d082, float         & d083,
+      float         & d084, float         & d085, float         & d086, float         & d087,
+      float         & d088, float         & d089, float         & d090, float         & d091,
+      float         & d092, float         & d093, float         & d094, float         & d095,
+      float         & d096, float         & d097, float         & d098, float         & d099,
+      float         & d100, float         & d101, float         & d102, float         & d103,
+      float         & d104, float         & d105, float         & d106, float         & d107,
+      float         & d108, float         & d109, float         & d110, float         & d111,
+      float         & d112, float         & d113, float         & d114, float         & d115,
+      float         & d116, float         & d117, float         & d118, float         & d119,
+      float         & d120, float         & d121, float         & d122, float         & d123,
+      uint32_t const& e,
+      GMMA::ScaleOut const scale_D = GMMA::ScaleOut::One)
+  {
+#if defined(CUTE_ARCH_MMA_SM90A_ENABLED)
+    cutlass::arch::synclog_emit_wgmma_reg_smem(__LINE__, desc_b);
+    asm volatile(
+    "{\n"
+      ".reg .pred p;\n"
+      "setp.ne.b32 p, %131, 0;\n"
+      "wgmma.mma_async.sp.sync.aligned.m64n248k64.f32.e5m2.e5m2 "
+      "{%0,   %1,   %2,   %3,   %4,   %5,   %6,   %7,   "
+      " %8,   %9,   %10,  %11,  %12,  %13,  %14,  %15,  "
+      " %16,  %17,  %18,  %19,  %20,  %21,  %22,  %23,  "
+      " %24,  %25,  %26,  %27,  %28,  %29,  %30,  %31,  "
+      " %32,  %33,  %34,  %35,  %36,  %37,  %38,  %39,  "
+      " %40,  %41,  %42,  %43,  %44,  %45,  %46,  %47,  "
+      " %48,  %49,  %50,  %51,  %52,  %53,  %54,  %55,  "
+      " %56,  %57,  %58,  %59,  %60,  %61,  %62,  %63,  "
+      " %64,  %65,  %66,  %67,  %68,  %69,  %70,  %71,  "
+      " %72,  %73,  %74,  %75,  %76,  %77,  %78,  %79,  "
+      " %80,  %81,  %82,  %83,  %84,  %85,  %86,  %87,  "
+      " %88,  %89,  %90,  %91,  %92,  %93,  %94,  %95,  "
+      " %96,  %97,  %98,  %99,  %100, %101, %102, %103, "
+      " %104, %105, %106, %107, %108, %109, %110, %111, "
+      " %112, %113, %114, %115, %116, %117, %118, %119, "
+      " %120, %121, %122, %123},"
+      "{%124, %125, %126, %127},"
+      " %128,"
+      " %129, %130,"
+      " p,    %132, %133;\n"
+    "}\n"
+      : "+f"(d000), "+f"(d001), "+f"(d002), "+f"(d003),
+        "+f"(d004), "+f"(d005), "+f"(d006), "+f"(d007),
+        "+f"(d008), "+f"(d009), "+f"(d010), "+f"(d011),
+        "+f"(d012), "+f"(d013), "+f"(d014), "+f"(d015),
+        "+f"(d016), "+f"(d017), "+f"(d018), "+f"(d019),
+        "+f"(d020), "+f"(d021), "+f"(d022), "+f"(d023),
+        "+f"(d024), "+f"(d025), "+f"(d026), "+f"(d027),
+        "+f"(d028), "+f"(d029), "+f"(d030), "+f"(d031),
+        "+f"(d032), "+f"(d033), "+f"(d034), "+f"(d035),
+        "+f"(d036), "+f"(d037), "+f"(d038), "+f"(d039),
+        "+f"(d040), "+f"(d041), "+f"(d042), "+f"(d043),
+        "+f"(d044), "+f"(d045), "+f"(d046), "+f"(d047),
+        "+f"(d048), "+f"(d049), "+f"(d050), "+f"(d051),
+        "+f"(d052), "+f"(d053), "+f"(d054), "+f"(d055),
+        "+f"(d056), "+f"(d057), "+f"(d058), "+f"(d059),
+        "+f"(d060), "+f"(d061), "+f"(d062), "+f"(d063),
+        "+f"(d064), "+f"(d065), "+f"(d066), "+f"(d067),
+        "+f"(d068), "+f"(d069), "+f"(d070), "+f"(d071),
+        "+f"(d072), "+f"(d073), "+f"(d074), "+f"(d075),
+        "+f"(d076), "+f"(d077), "+f"(d078), "+f"(d079),
+        "+f"(d080), "+f"(d081), "+f"(d082), "+f"(d083),
+        "+f"(d084), "+f"(d085), "+f"(d086), "+f"(d087),
+        "+f"(d088), "+f"(d089), "+f"(d090), "+f"(d091),
+        "+f"(d092), "+f"(d093), "+f"(d094), "+f"(d095),
+        "+f"(d096), "+f"(d097), "+f"(d098), "+f"(d099),
+        "+f"(d100), "+f"(d101), "+f"(d102), "+f"(d103),
+        "+f"(d104), "+f"(d105), "+f"(d106), "+f"(d107),
+        "+f"(d108), "+f"(d109), "+f"(d110), "+f"(d111),
+        "+f"(d112), "+f"(d113), "+f"(d114), "+f"(d115),
+        "+f"(d116), "+f"(d117), "+f"(d118), "+f"(d119),
+        "+f"(d120), "+f"(d121), "+f"(d122), "+f"(d123)
+      :  "r"(a000),  "r"(a001),  "r"(a002),  "r"(a003),
+         "l"(desc_b),
+         "r"(e), "n"(int32_t(spsel)),
+         "r"(int32_t(scale_D)), "n"(int32_t(scaleA)), "n"(int32_t(scaleB)));
+#else
+    CUTE_INVALID_CONTROL_PATH("Attempting to use SM90::GMMA::SPARSE::GMMA_64x248x64_F32E5M2E5M2_RS_TN without CUTE_ARCH_MMA_SM90A_ENABLED");
+#endif
+  }
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+} // namespace SM90::GMMA::SPARSE
+
+} // namespace cute
diff --git a/include/cute/arch/util.hpp b/include/cute/arch/util.hpp
index 124a2560b6..9290439b9f 100644
--- a/include/cute/arch/util.hpp
+++ b/include/cute/arch/util.hpp
@@ -31,7 +31,6 @@
 #pragma once
 
 #include <cute/config.hpp>
-
 #include <cute/numeric/integer_sequence.hpp>
 
 #if defined(__clang__) && defined(__CUDA__)
@@ -258,40 +257,29 @@ explode(Fn fn,
   return fn(d[Id]..., a[Ia]..., b[Ib]..., c[Ic]..., e[Ie]..., f[If]...);
 }
 
-#if defined(CUTLASS_ENABLE_SYCL)
-template <class MMA_Op,
-          class PtrA, int... I>
-CUTE_HOST_DEVICE constexpr
-void
-explode_mma(PtrA&& a, int_sequence<I...>)
-{
-  return MMA_Op::fma(a[I]...);
-}
-
-template <class MMA_Op,
-          class PtrS, int... Is,
-          class PtrD, int... Id>
-CUTE_HOST_DEVICE constexpr
-void
-explode_mma(PtrS&& s, int_sequence<Is...>,
-            PtrD&& d, int_sequence<Id...>)
-{
-  return MMA_Op::fma(s[Is]..., d[Id]...);
-}
-
-template <class MMA_Op,
+template <class Fn,
+          class PtrD, int... Id,
           class PtrA, int... Ia,
           class PtrB, int... Ib,
-          class PtrC, int... Ic>
+          class PtrC, int... Ic,
+          class PtrE, int... Ie,
+          class PtrF, int... If,
+          class PtrG, int... Ig>
 CUTE_HOST_DEVICE constexpr
 void
-explode_mma(PtrA&& a, int_sequence<Ia...>,
-            PtrB&& b, int_sequence<Ib...>,
-            PtrC&& c, int_sequence<Ic...>)
+explode(Fn fn,
+        PtrD&& d, int_sequence<Id...>,
+        PtrA&& a, int_sequence<Ia...>,
+        PtrB&& b, int_sequence<Ib...>,
+        PtrC&& c, int_sequence<Ic...>,
+        PtrE&& e, int_sequence<Ie...>,
+        PtrF&& f, int_sequence<If...>,
+        PtrG&& g, int_sequence<Ig...>)
 {
-  return MMA_Op::fma(a[Ia]..., b[Ib]..., c[Ic]...);
+  return fn(d[Id]..., a[Ia]..., b[Ib]..., c[Ic]..., e[Ie]..., f[If]..., g[Ig]...);
 }
 
+#if defined(CUTLASS_ENABLE_SYCL)
 template <class MMA_Op,
           class PtrD, int... Id,
           class PtrA, int... Ia,
@@ -300,48 +288,12 @@ template <class MMA_Op,
 CUTE_HOST_DEVICE constexpr
 void
 explode_mma(PtrD&& d, int_sequence<Id...>,
-            PtrA&& a, int_sequence<Ia...>,
-            PtrB&& b, int_sequence<Ib...>,
-            PtrC&& c, int_sequence<Ic...>)
+        PtrA&& a, int_sequence<Ia...>,
+        PtrB&& b, int_sequence<Ib...>,
+        PtrC&& c, int_sequence<Ic...>)
 {
   return MMA_Op::fma(d[Id]..., a[Ia]..., b[Ib]..., c[Ic]...);
 }
-
-template <class MMA_Op,
-          class PtrD, int... Id,
-          class PtrA, int... Ia,
-          class PtrB, int... Ib,
-          class PtrC, int... Ic,
-          class PtrE, int... Ie>
-CUTE_HOST_DEVICE constexpr
-void
-explode_mma(PtrD&& d, int_sequence<Id...>,
-            PtrA&& a, int_sequence<Ia...>,
-            PtrB&& b, int_sequence<Ib...>,
-            PtrC&& c, int_sequence<Ic...>,
-            PtrE&& e, int_sequence<Ie...>)
-{
-  return MMA_Op::fma(d[Id]..., a[Ia]..., b[Ib]..., c[Ic]..., e[Ie]...);
-}
-
-template <class MMA_Op,
-          class PtrD,   int... Id,
-          class PtrA,   int... Ia,
-          class PtrB,   int... Ib,
-          class PtrC,   int... Ic,
-          class PtrSFA, int... Isfa,
-          class PtrSFB, int... Isfb>
-CUTE_HOST_DEVICE constexpr
-void
-explode_mma(PtrD&& d,     int_sequence<Id...>,
-            PtrA&& a,     int_sequence<Ia...>,
-            PtrB&& b,     int_sequence<Ib...>,
-            PtrC&& c,     int_sequence<Ic...>,
-            PtrSFA&& sfa, int_sequence<Isfa...>,
-            PtrSFB&& sfb, int_sequence<Isfb...>)
-{
-  return MMA_Op::fma(d[Id]..., a[Ia]..., b[Ib]..., c[Ic]..., sfa[Isfa]..., sfb[Isfb]...);
-}
 #endif
 
 //
diff --git a/include/cute/atom/copy_atom.hpp b/include/cute/atom/copy_atom.hpp
index 2546254771..0da2dd4141 100644
--- a/include/cute/atom/copy_atom.hpp
+++ b/include/cute/atom/copy_atom.hpp
@@ -30,16 +30,13 @@
  **************************************************************************************************/
 #pragma once
 
-#include <cute/config.hpp>
-
-#include <cute/arch/copy.hpp>
-
-#include <cute/atom/copy_traits.hpp>
-#include <cute/atom/mma_atom.hpp>
-
-#include <cute/util/type_traits.hpp>
-
-#include <cute/tensor_impl.hpp>
+#include <cute/config.hpp>                     // CUTE_HOST_DEVICE
+#include <cute/tensor_impl.hpp>                // cute::Tensor
+#include <cute/util/type_traits.hpp>           // cute::__CUTE_REQUIRES
+#include <cute/container/tuple.hpp>            // cute::is_tuple
+#include <cute/numeric/integral_constant.hpp>  // cute::is_constant, cute::is_integral
+#include <cute/atom/copy_traits.hpp>           // cute::Copy_Traits
+#include <cute/atom/mma_atom.hpp>              // cute::TiledMMA
 
 namespace cute
 {
@@ -651,10 +648,12 @@ print(ThrCopy<TiledCopy, ThrIdx> const& thr_copy)
   print(TiledCopy{});
 }
 
-template <class... Args>
+// TiledCopy to LaTeX TikZ
+template <class... Args, class TikzColorFn = TikzColor_TV>
 CUTE_HOST_DEVICE
 auto
-print_latex(TiledCopy<Args...> const& copy)
+print_latex(TiledCopy<Args...> const& copy,
+            TikzColorFn color = {})              // lambda(thr_idx,val_idx) -> tikz color string
 {
   auto [layoutS_MN, thrID_S] = copy.get_layoutS_MN();
   auto [layoutD_MN, thrID_D] = copy.get_layoutD_MN();
@@ -663,13 +662,15 @@ print_latex(TiledCopy<Args...> const& copy)
                    layoutD_MN, thrID_D);
 }
 
-// MNK Copy Layout to Latex TIKZ -- 8-value color coded by thread
+// MNK Copy Layout to LaTeX TikZ
 template <class LayoutS, class ThrIDS,
-          class LayoutD, class ThrIDD>
+          class LayoutD, class ThrIDD,
+          class TikzColorFn = TikzColor_TV>
 CUTE_HOST_DEVICE
 void
 print_latex_copy(LayoutS const& S, ThrIDS const& TS,  // (m,n) -> (tid,vid)  and  tid -> thr_idx
-                 LayoutD const& D, ThrIDD const& TD)  // (m,n) -> (tid,vid)  and  tid -> thr_idx
+                 LayoutD const& D, ThrIDD const& TD,  // (m,n) -> (tid,vid)  and  tid -> thr_idx
+                 TikzColorFn color = {})              // lambda(thr_idx,val_idx) -> tikz color string
 {
   CUTE_STATIC_ASSERT_V(rank(S) == Int<2>{});
   CUTE_STATIC_ASSERT_V(rank(D) == Int<2>{});
@@ -677,33 +678,17 @@ print_latex_copy(LayoutS const& S, ThrIDS const& TS,  // (m,n) -> (tid,vid)  and
   assert(size<0>(S) == size<0>(D));
   assert(size<1>(S) == size<1>(D));
 
-  char const* latex_header =
-      "\\documentclass{standalone}\n"
-      "\\usepackage{tikz}\n"
-      "\\usetikzlibrary{external}\n"
-      "\\tikzexternalize\n"
-      "\\begin{document}\n"
-      "\\begin{tikzpicture}[x={(0cm,-1cm)},y={(1cm,0cm)},box/.style={rectangle,draw=black,thick,minimum size=1cm,anchor=center}]\n\n";
-  char const* latex_footer =
-      "\\end{tikzpicture}\n"
-      "\\end{document}\n";
-
-  char const* color_map[8] = {"{rgb,255:red,175;green,175;blue,255}",
-                              "{rgb,255:red,175;green,255;blue,175}",
-                              "{rgb,255:red,255;green,255;blue,175}",
-                              "{rgb,255:red,255;green,175;blue,175}",
-                              "{rgb,255:red,210;green,210;blue,255}",
-                              "{rgb,255:red,210;green,255;blue,210}",
-                              "{rgb,255:red,255;green,255;blue,210}",
-                              "{rgb,255:red,255;green,210;blue,210}",};
-
-  // Header
+  // Commented prints
   printf("%% LayoutS: "); print(S);  printf("\n");
   printf("%% ThrIDS : "); print(TS); printf("\n");
   printf("%% LayoutD: "); print(D);  printf("\n");
   printf("%% ThrIDD : "); print(TD); printf("\n\n");
 
-  printf(latex_header);
+  // Header
+  printf("\\documentclass[convert]{standalone}\n"
+         "\\usepackage{tikz}\n\n"
+         "\\begin{document}\n"
+         "\\begin{tikzpicture}[x={(0cm,-1cm)},y={(1cm,0cm)},every node/.style={minimum size=1cm, outer sep=0pt}]\n\n");
 
   // S starting at 0,0
   for (int i = 0; i < size<0>(S); ++i) {
@@ -712,12 +697,22 @@ print_latex_copy(LayoutS const& S, ThrIDS const& TS,  // (m,n) -> (tid,vid)  and
       int val_idx = S(i,j) / size(TS);
       int thr_idx = TS(thrid);
 
-      printf("\\node[box,fill=%s] at (%d,%d) {\\shortstack{T%d \\\\ V%d}};\n",
-             color_map[thr_idx % 8],
+      printf("\\node[fill=%s] at (%d,%d) {\\shortstack{T%d \\\\ V%d}};\n",
+             color(thr_idx, val_idx),
              i, j,
              thr_idx, val_idx);
     }
   }
+  // Grid
+  printf("\\draw[color=black,thick,shift={(-0.5,-0.5)}] (%d,%d) grid (%d,%d);\n\n",
+         0, 0, int(size<0>(S)), int(size<1>(S)));
+  // S Labels
+  for (int i =  0, j = -1; i < size<0>(S); ++i) {
+    printf("\\node at (%d,%d) {\\Large{\\texttt{%d}}};\n", i, j, i);
+  }
+  for (int i = -1, j =  0; j < size<1>(S); ++j) {
+    printf("\\node at (%d,%d) {\\Large{\\texttt{%d}}};\n", i, j, j);
+  }
 
   // D starting at 0,size<1>(S)+3
   for (int i = 0; i < size<0>(D); ++i) {
@@ -726,30 +721,26 @@ print_latex_copy(LayoutS const& S, ThrIDS const& TS,  // (m,n) -> (tid,vid)  and
       int val_idx = D(i,j) / size(TD);
       int thr_idx = TD(thrid);
 
-      printf("\\node[box,fill=%s] at (%d,%d) {\\shortstack{T%d \\\\ V%d}};\n",
-             color_map[thr_idx % 8],
+      printf("\\node[fill=%s] at (%d,%d) {\\shortstack{T%d \\\\ V%d}};\n",
+             color(thr_idx, val_idx),
              i, j + size<1>(S) + 3,
              thr_idx, val_idx);
     }
   }
-
-  // S Labels
-  for (int i = 0, j = -1; i < size<0>(S); ++i) {
-    printf("\\node at (%d,%d) {\\Large{\\texttt{%d}}};\n", i, j, i);
-  }
-  for (int j = 0, i = -1; j < size<1>(S); ++j) {
-    printf("\\node at (%d,%d) {\\Large{\\texttt{%d}}};\n", i, j, j);
-  }
+  // Grid
+  printf("\\draw[color=black,thick,shift={(-0.5,-0.5)}] (%d,%d) grid (%d,%d);\n\n",
+         0, int(size<1>(S)+3), int(size<0>(D)), int(size<1>(D)+size<1>(S)+3));
   // D Labels
-  for (int i = 0, j = size<1>(D); i < size<0>(S); ++i) {
+  for (int i = 0, j = size<1>(D); i < size<0>(D); ++i) {
     printf("\\node at (%d,%d) {\\Large{\\texttt{%d}}};\n", i, j + size<1>(S) + 3, i);
   }
-  for (int j = 0, i = -1; j < size<1>(D); ++j) {
+  for (int i = -1, j =         0; j < size<1>(D); ++j) {
     printf("\\node at (%d,%d) {\\Large{\\texttt{%d}}};\n", i, j + size<1>(S) + 3, j);
   }
 
   // Footer
-  printf(latex_footer);
+  printf("\\end{tikzpicture}\n"
+         "\\end{document}\n");
 }
 
 } // end namespace cute
diff --git a/include/cute/atom/copy_traits_sm50.hpp b/include/cute/atom/copy_traits_sm50.hpp
index 8be0ef7bba..7a693805e6 100644
--- a/include/cute/atom/copy_traits_sm50.hpp
+++ b/include/cute/atom/copy_traits_sm50.hpp
@@ -39,7 +39,7 @@ namespace cute
 {
 
 template <>
-struct Copy_Traits<SM50_Shuffle_U32_2x2Trans>
+struct Copy_Traits<SM50_Shuffle_U32_2x2Trans_XOR1>
 {
   // Logical thread id to thread idx (one-thread)
   using ThrID = Layout<_32>;
@@ -55,4 +55,21 @@ struct Copy_Traits<SM50_Shuffle_U32_2x2Trans>
   using RefLayout = SrcLayout;
 };
 
+template <>
+struct Copy_Traits<SM50_Shuffle_U32_2x2Trans_XOR4>
+{
+  // Logical thread id to thread idx (one-thread)
+  using ThrID = Layout<_32>;
+ 
+  // Map from (src-thr,src-val) to bit
+  using SrcLayout = Layout<Shape <_32,_64>,
+                           Stride<_64, _1>>;
+  // Map from (dst-thr,dst-val) to bit
+  using DstLayout = Layout<Shape <Shape < _4,  _2,   _4>, Shape<_32,   _2>>,
+                           Stride<Stride<_64, _32, _512>,Stride< _1, _256>>>;
+
+  // Reference map from (thr,val) to bit
+  using RefLayout = SrcLayout;
+};
+
 } // end namespace cute
diff --git a/include/cute/atom/copy_traits_sm90_im2col.hpp b/include/cute/atom/copy_traits_sm90_im2col.hpp
index f6c9e258eb..54f76073b1 100644
--- a/include/cute/atom/copy_traits_sm90_im2col.hpp
+++ b/include/cute/atom/copy_traits_sm90_im2col.hpp
@@ -40,6 +40,8 @@
 
 #include "cute/algorithm/prefetch.hpp"
 #include "cutlass/fast_math.h"
+#include "cutlass/cuda_host_adapter.hpp"
+
 namespace cute
 {
 
@@ -448,9 +450,11 @@ make_im2col_tma_copy_desc(
   CUtensorMapInterleave   tma_interleave  = CU_TENSOR_MAP_INTERLEAVE_NONE;
   CUtensorMapL2promotion  tma_l2Promotion = to_CUtensorMapL2promotion(aux_params.l2promo_);
   CUtensorMapFloatOOBfill tma_oob_fill    = to_CUtensorMapFloatOOBfill(aux_params.oobfill_);
-  CUtensorMapSwizzle      tma_swizzle     = TMA::to_CUtensorMapSwizzle(detail::get_tma_swizzle_bits(smem_swizzle));
+  TMA::SmemSwizzleBits    swizzle_bits    = detail::get_tma_swizzle_bits(smem_swizzle);
+  TMA::SmemSwizzleBase    swizzle_base    = detail::get_tma_swizzle_base(smem_swizzle);
+  CUtensorMapSwizzle      tma_swizzle     = TMA::to_CUtensorMapSwizzle(swizzle_bits, swizzle_base);
 
-  CUresult encode_result = cuTensorMapEncodeIm2col(
+  CUresult encode_result = CUTLASS_CUDA_DRIVER_WRAPPER_CALL(cuTensorMapEncodeIm2col)(
       &tma_desc,
       tma_format,
       num_total_modes,
@@ -634,11 +638,11 @@ make_tma_atom_im2col(CopyOp,
 
   auto range_c    = size<0,0>(tma_layout_vt);
   auto range_whdn = size<0,1>(tma_layout_vt);
-
   Tensor gtensor_cwhdn = make_tensor(gtensor.data(),
-                                     flatten(make_layout(basis_get(stride<0,0>(tma_layout_vt), gtensor.layout()),
-                                                         basis_get(stride<0,1>(tma_layout_vt), gtensor.layout()))));
-
+                                     flatten(make_layout(make_layout(basis_get(stride<0,0>(tma_layout_vt), gtensor.shape()),
+                                                                     basis_get(stride<0,0>(tma_layout_vt), gtensor.stride())),
+                                                         make_layout(basis_get(stride<0,1>(tma_layout_vt), gtensor.shape()),
+                                                                     basis_get(stride<0,1>(tma_layout_vt), gtensor.stride())))));
   auto [tma_desc, tma_tensor] = make_im2col_tma_copy_desc(
       gtensor_cwhdn,
       range_c,
diff --git a/include/cute/atom/copy_traits_sm90_tma.hpp b/include/cute/atom/copy_traits_sm90_tma.hpp
index e86d035e9d..de0e0424df 100644
--- a/include/cute/atom/copy_traits_sm90_tma.hpp
+++ b/include/cute/atom/copy_traits_sm90_tma.hpp
@@ -42,6 +42,8 @@
 
 #include <cute/numeric/integral_ratio.hpp>
 
+#include <cutlass/cuda_host_adapter.hpp>
+
 namespace cute
 {
 
@@ -240,15 +242,22 @@ struct Copy_Traits<SM90_TMA_LOAD_MULTICAST, NumBitsPerTMA, AuxParams_>
   // Construct an executable SM90_TMA_LOAD_MULTICAST with tma_mbar
   CUTE_HOST_DEVICE constexpr
   Copy_Traits<SM90_TMA_LOAD_MULTICAST_OP, NumBitsPerTMA>
-  with(uint64_t& tma_load_mbar, uint16_t const& multicast_mask) const {
-    return {{}, {&tma_desc_, &tma_load_mbar, multicast_mask}};
+  with(
+    uint64_t& tma_load_mbar,
+    uint16_t const& multicast_mask,
+    TMA::CacheHintSm90 const& cache_hint = TMA::CacheHintSm90::EVICT_NORMAL) const {
+    return {{}, {&tma_desc_, &tma_load_mbar, multicast_mask, static_cast<uint64_t>(cache_hint)}};
   }
 
   // Construct an executable SM90_TMA_LOAD_MULTICAST_OP with tma_mbar (temp. overloaded for grouped gemm/ptr array gemm)
   CUTE_HOST_DEVICE constexpr
   Copy_Traits<SM90_TMA_LOAD_MULTICAST_OP, NumBitsPerTMA>
-  with(TmaDescriptor const* new_tma_desc, uint64_t& tma_load_mbar, uint16_t const& multicast_mask) const {
-    return {{}, {new_tma_desc, &tma_load_mbar, multicast_mask}};
+  with(
+    TmaDescriptor const* new_tma_desc,
+    uint64_t& tma_load_mbar,
+    uint16_t const& multicast_mask,
+    TMA::CacheHintSm90 const& cache_hint = TMA::CacheHintSm90::EVICT_NORMAL) const {
+    return {{}, {new_tma_desc, &tma_load_mbar, multicast_mask, static_cast<uint64_t>(cache_hint)}};
   }
 
   // Generate the TMA coord tensor
@@ -286,7 +295,8 @@ struct Copy_Traits<SM90_TMA_LOAD_MULTICAST_OP, NumBitsPerTMA>
   tuple<
   TmaDescriptor const*,
   uint64_t*, // smem mbarrier
-  uint16_t   // multicast mask
+  uint16_t,  // multicast mask
+  uint64_t   // cache hint
   > const opargs_;
 };
 
@@ -683,8 +693,10 @@ construct_tma_gbasis(Tensor<GEngine,GLayout> const& gtensor,       // The origin
   // TMA parameter checking
   //
 
-  CUTE_STATIC_ASSERT_V(product_each(shape(slayout)) == product_each(shape(cta_v_map)),
-                       "TMA requires CTA_Tile and SLayout top-level shape equivalence.");
+  // CUTE_STATIC_ASSERT_V(product_each(shape(slayout)) == product_each(shape(cta_v_map)),
+  //                      "TMA requires CTA_Tile and SLayout top-level shape equivalence.");
+  CUTE_STATIC_ASSERT_V(size(slayout) == size(cta_v_map),
+                       "TMA requires CTA_Tile and SLayout top-level size equivalence.");
 
 #if 0
   print("gtensor         : "); print(gtensor); print("\n");
@@ -982,8 +994,10 @@ make_tma_copy_desc(Tensor<GEngine,GLayout> const& gtensor,         // The origin
     CUtensorMapFloatOOBfill tma_oobFill     = CU_TENSOR_MAP_FLOAT_OOB_FILL_NONE;
 
     // TMA smem swizzle type
-    CUtensorMapSwizzle smem_swizzle = TMA::to_CUtensorMapSwizzle(get_tma_swizzle_bits(swizzle));
-    CUresult result = cuTensorMapEncodeTiled(
+    TMA::SmemSwizzleBits swizzle_bits = get_tma_swizzle_bits(swizzle);
+    TMA::SmemSwizzleBase swizzle_base = get_tma_swizzle_base(swizzle);
+    CUtensorMapSwizzle smem_swizzle = TMA::to_CUtensorMapSwizzle(swizzle_bits, swizzle_base);
+    CUresult result = CUTLASS_CUDA_DRIVER_WRAPPER_CALL(cuTensorMapEncodeTiled)(
         &tma_desc,
         tma_format,
         tma_dim,
diff --git a/include/cute/atom/copy_traits_sm90_tma_swizzle.hpp b/include/cute/atom/copy_traits_sm90_tma_swizzle.hpp
index 73ced00cb7..47dcb6c7d0 100644
--- a/include/cute/atom/copy_traits_sm90_tma_swizzle.hpp
+++ b/include/cute/atom/copy_traits_sm90_tma_swizzle.hpp
@@ -68,4 +68,26 @@ get_tma_swizzle_bits(Layout const& layout)
   return get_tma_swizzle_bits(get_swizzle_portion(layout));
 }
 
+template <int B, int M, int S>
+CUTE_HOST_DEVICE constexpr
+TMA::SmemSwizzleBase
+get_tma_swizzle_base(Swizzle<B,M,S>)
+{
+  if constexpr (M == 4) {
+    static_assert(0 <= B && B <= 3, "Expected B = 0,1,2, or 3 when M == 4. Unsupported layout swizzle.");
+    static_assert(S == 3, "Expected S = 3 when M == 4. Unsupported layout swizzle.");
+    return TMA::SmemSwizzleBase::SWIZZLE_BASE_16B;
+  } 
+  else {
+    static_assert(M == 4, "Expected 128b=16B=(2^4)B base swizzle.");
+  }
+}
+
+template <class Layout>
+TMA::SmemSwizzleBase
+get_tma_swizzle_base(Layout const& layout)
+{
+  return get_tma_swizzle_base(get_swizzle_portion(layout));
+}
+
 } // namespace cute::detail
diff --git a/include/cute/atom/mma_atom.hpp b/include/cute/atom/mma_atom.hpp
index 3a9bf6eaf7..7d141b5562 100644
--- a/include/cute/atom/mma_atom.hpp
+++ b/include/cute/atom/mma_atom.hpp
@@ -49,11 +49,12 @@ template <class MMAOperation>
 struct MMA_Atom<MMAOperation> : MMA_Atom<MMA_Traits<MMAOperation>>
 {};
 
-template <class... Args>
-struct MMA_Atom<MMA_Traits<Args...>>
-  : MMA_Traits<Args...>
+template <class MMAOperation, class... Args>
+struct MMA_Atom<MMA_Traits<MMAOperation, Args...>>
+  : MMA_Traits<MMAOperation, Args...>
 {
-  using Traits = MMA_Traits<Args...>;
+  using MMA_Op = MMAOperation;
+  using Traits = MMA_Traits<MMAOperation, Args...>;
 
   // Element value types from the MMA_Traits
   using ValTypeD = typename Traits::ValTypeD;
@@ -335,7 +336,7 @@ struct TiledMMA : MMA_Atom
                             make_layout(size<2>(AtomShape_MNK{})));
     auto b_tensor = zipped_divide(t_tensor, b_tile);                 // ((AtomN,AtomK),(RestN,RestK))
 
-    // Transform the Atom mode from (N,K) to (Thr,Val)
+    // Transform the Atom mode from (M,K) to (Thr,Val)
     auto tv_tensor = b_tensor.compose(AtomLayoutB_TV{},_);           // ((ThrV,FrgV),(RestN,RestK))
 
     // Tile the tensor for the Thread
@@ -737,18 +738,22 @@ print(ThrMMA<TiledMMA, ThrVMNK> const& thr_mma)
   print(static_cast<TiledMMA>(thr_mma));
 }
 
-template <class... Args>
+// MMA Atom to LaTeX TikZ
+template <class... Args, class TikzColorFn = TikzColor_TV>
 CUTE_HOST_DEVICE
 void
-print_latex(MMA_Atom<Args...> const& mma_atom)
+print_latex(MMA_Atom<Args...> const& mma_atom,
+            TikzColorFn color = {})             // lambda(thr_idx,val_idx) -> tikz color string
 {
   print_latex(make_tiled_mma(mma_atom));
 }
 
-template <class... Args>
+// TiledMMA to LaTeX TikZ
+template <class... Args, class TikzColorFn = TikzColor_TV>
 CUTE_HOST_DEVICE
 void
-print_latex(TiledMMA<Args...> const& mma)
+print_latex(TiledMMA<Args...> const& mma,
+            TikzColorFn color = {})             // lambda(thr_idx,val_idx) -> tikz color string
 {
   auto layout_and_thrid_C = mma.get_layoutC_MN();
   auto layoutC_MN = get<0>(layout_and_thrid_C);
@@ -767,6 +772,109 @@ print_latex(TiledMMA<Args...> const& mma)
                   layoutB_NK, thrID_B);
 }
 
+// MNK MMA Layout to LaTeX TikZ
+template <class LayoutC, class ThrIDC,
+          class LayoutA, class ThrIDA,
+          class LayoutB, class ThrIDB,
+          class TikzColorFn = TikzColor_TV>
+CUTE_HOST_DEVICE
+void
+print_latex_mma(LayoutC const& C, ThrIDC const& TC,  // (m,n) -> (tid,vid)  and  tid -> thr_idx
+                LayoutA const& A, ThrIDA const& TA,  // (m,k) -> (tid,vid)  and  tid -> thr_idx
+                LayoutB const& B, ThrIDB const& TB,  // (n,k) -> (tid,vid)  and  tid -> thr_idx
+                TikzColorFn color = {})              // lambda(thr_idx,val_idx) -> tikz color string
+{
+  CUTE_STATIC_ASSERT_V(rank(C) == Int<2>{});
+  CUTE_STATIC_ASSERT_V(rank(A) == Int<2>{});
+  CUTE_STATIC_ASSERT_V(rank(B) == Int<2>{});
+
+  assert(size<0>(A) == size<0>(C));
+  assert(size<0>(B) == size<1>(C));
+  assert(size<1>(A) == size<1>(B));
+
+  // Commented prints
+  printf("%% LayoutC: "); print(C);  printf("\n");
+  printf("%% ThrIDC : "); print(TC); printf("\n");
+  printf("%% LayoutA: "); print(A);  printf("\n");
+  printf("%% ThrIDA : "); print(TA); printf("\n");
+  printf("%% LayoutB: "); print(B);  printf("\n");
+  printf("%% ThrIDB : "); print(TB); printf("\n\n");
+  // Header
+  printf("\\documentclass[convert]{standalone}\n"
+         "\\usepackage{tikz}\n\n"
+         "\\begin{document}\n"
+         "\\begin{tikzpicture}[x={(0cm,-1cm)},y={(1cm,0cm)},every node/.style={minimum size=1cm, outer sep=0pt}]\n\n");
+
+  // C starting at 0,0
+  for (int m = 0; m < size<0>(C); ++m) {
+    for (int n = 0; n < size<1>(C); ++n) {
+      int thrid   = C(m,n) % size(TC);
+      int val_idx = C(m,n) / size(TC);
+      int thr_idx = TC(thrid);
+
+      printf("\\node[fill=%s] at (%d,%d) {\\shortstack{T%d \\\\ V%d}};\n",
+             color(thr_idx, val_idx),
+             m, n,
+             thr_idx, val_idx);
+    }
+  }
+  // Grid
+  printf("\\draw[color=black,thick,shift={(-0.5,-0.5)}] (%d,%d) grid (%d,%d);\n\n",
+         0, 0, int(size<0>(C)), int(size<1>(C)));
+
+  // A starting at 0,-size<1>(A)-1
+  for (int m = 0; m < size<0>(A); ++m) {
+    for (int k = 0; k < size<1>(A); ++k) {
+      int thrid   = A(m,k) % size(TA);
+      int val_idx = A(m,k) / size(TA);
+      int thr_idx = TA(thrid);
+
+      printf("\\node[fill=%s] at (%d,%d) {\\shortstack{T%d \\\\ V%d}};\n",
+             color(thr_idx, val_idx),
+             m, k-1-size<1>(A),
+             thr_idx, val_idx);
+    }
+  }
+  // Grid
+  printf("\\draw[color=black,thick,shift={(-0.5,-0.5)}] (%d,%d) grid (%d,%d);\n\n",
+         0, int(-size<1>(A)-1), int(size<0>(A)), -1);
+  // A labels
+  for (int m =  0, k = -1; m < size<0>(A); ++m) {
+    printf("\\node at (%d,%d) {\\Large{\\texttt{%d}}};\n", m, k-1-size<1>(A), m);
+  }
+  for (int m = -1, k =  0; k < size<1>(A); ++k) {
+    printf("\\node at (%d,%d) {\\Large{\\texttt{%d}}};\n", m, k-1-size<1>(A), k);
+  }
+
+  // B starting at -size<1>(B)-1,0
+  for (int n = 0; n < size<0>(B); ++n) {
+    for (int k = 0; k < size<1>(B); ++k) {
+      int thrid   = B(n,k) % size(TB);
+      int val_idx = B(n,k) / size(TB);
+      int thr_idx = TB(thrid);
+
+      printf("\\node[fill=%s] at (%d,%d) {\\shortstack{T%d \\\\ V%d}};\n",
+             color(thr_idx, val_idx),
+             k-1-size<1>(B), n,
+             thr_idx, val_idx);
+    }
+  }
+  // Grid
+  printf("\\draw[color=black,thick,shift={(-0.5,-0.5)}] (%d,%d) grid (%d,%d);\n\n",
+         int(-size<1>(B)-1), 0, -1, int(size<0>(B)));
+  // B labels
+  for (int n =  0, k = -1; n < size<0>(B); ++n) {
+    printf("\\node at (%d,%d) {\\Large{\\texttt{%d}}};\n", k-1-size<1>(B), n, n);
+  }
+  for (int n = -1, k =  0; k < size<1>(B); ++k) {
+    printf("\\node at (%d,%d) {\\Large{\\texttt{%d}}};\n", k-1-size<1>(B), n, k);
+  }
+
+  // Footer
+  printf("\\end{tikzpicture}\n"
+         "\\end{document}\n");
+}
+
 // MNK MMA Layout to console printer
 template <class LayoutC, class ThrIDC,
           class LayoutA, class ThrIDA,
@@ -823,113 +931,181 @@ print_layout_mma(LayoutC const& C, ThrIDC const& TC,  // (m,n) -> (tid,vid)  and
   printf("+\n");
 }
 
-// MNK MMA Layout to Latex TIKZ -- 8-value color coded by thread
+// MNK MMA Layout to SVG -- 8-value color coded by thread
 template <class LayoutC, class ThrIDC,
           class LayoutA, class ThrIDA,
           class LayoutB, class ThrIDB>
 CUTE_HOST_DEVICE
 void
-print_latex_mma(LayoutC const& C, ThrIDC const& TC,  // (m,n) -> (tid,vid)  and  tid -> thr_idx
-                LayoutA const& A, ThrIDA const& TA,  // (m,k) -> (tid,vid)  and  tid -> thr_idx
-                LayoutB const& B, ThrIDB const& TB)  // (n,k) -> (tid,vid)  and  tid -> thr_idx
+print_svg_mma(LayoutC const& C, ThrIDC const& TC,  // (m,n) -> (tid,vid)  and  tid -> thr_idx
+              LayoutA const& A, ThrIDA const& TA,  // (m,k) -> (tid,vid)  and  tid -> thr_idx
+              LayoutB const& B, ThrIDB const& TB)  // (n,k) -> (tid,vid)  and  tid -> thr_idx
 {
-  CUTE_STATIC_ASSERT_V(rank(C) == Int<2>{});
-  CUTE_STATIC_ASSERT_V(rank(A) == Int<2>{});
-  CUTE_STATIC_ASSERT_V(rank(B) == Int<2>{});
+  char const *color_map[8] = {"175,175,255", "175,255,175", "255,255,175",
+                              "255,175,175", "210,210,255", "210,255,210",
+                              "255,255,210", "255,210,210"};
+
+  const int cell_width = 20;
+  const int cell_height = 20;
+
+  const int page_width = (size<1>(A) + size<0>(B) + 2) * cell_width;
+  const int page_height = (size<1>(B) + size<0>(A) + 2) * cell_height;
+
+  // header
+  printf("<svg width=\"100%%\" height=\"100%%\" viewBox=\"0 0 %d %d\" "
+         "preserveAspectRatio=\"xMidYMid meet\" "
+         "xmlns=\"http://www.w3.org/2000/svg\">\n",
+         page_width, page_height);
+
+  // C
+  int c_base_x = (size<1>(A) + 2) * cell_width;
+  int c_base_y = (size<1>(B) + 2) * cell_height;
+  for (int m = 0; m < cute::size<0>(C); ++m) {
+    for (int n = 0; n < cute::size<1>(C); ++n) {
+
+      int thrid = C(m, n) % size(TC);
+      int val_idx = C(m, n) / size(TC);
+      int thr_idx = TC(thrid);
 
-  assert(size<0>(A) == size<0>(C));
-  assert(size<0>(B) == size<1>(C));
-  assert(size<1>(A) == size<1>(B));
+      int x = n * cell_width + c_base_x;
+      int y = m * cell_height + c_base_y;
 
-  char const* latex_header =
-      "\\documentclass{standalone}\n"
-      "\\usepackage{tikz}\n"
-      "\\usetikzlibrary{external}\n"
-      "\\tikzexternalize\n"
-      "\\begin{document}\n"
-      "\\begin{tikzpicture}[x={(0cm,-1cm)},y={(1cm,0cm)},box/.style={rectangle,draw=black,thick,minimum size=1cm,anchor=center}]\n\n";
-  char const* latex_footer =
-      "\\end{tikzpicture}\n"
-      "\\end{document}\n";
-
-  char const* color_map[8] = {"{rgb,255:red,175;green,175;blue,255}",
-                              "{rgb,255:red,175;green,255;blue,175}",
-                              "{rgb,255:red,255;green,255;blue,175}",
-                              "{rgb,255:red,255;green,175;blue,175}",
-                              "{rgb,255:red,210;green,210;blue,255}",
-                              "{rgb,255:red,210;green,255;blue,210}",
-                              "{rgb,255:red,255;green,255;blue,210}",
-                              "{rgb,255:red,255;green,210;blue,210}"};
+      int thr_x = x + cell_width / 2;
+      int thr_y = y + cell_height / 4;
+      int val_x = x + cell_width / 2;
+      int val_y = y + cell_height * 3 / 4;
 
-  // Header
-  printf("%% LayoutC: "); print(C);  printf("\n");
-  printf("%% ThrIDC : "); print(TC); printf("\n");
-  printf("%% LayoutA: "); print(A);  printf("\n");
-  printf("%% ThrIDA : "); print(TA); printf("\n");
-  printf("%% LayoutB: "); print(B);  printf("\n");
-  printf("%% ThrIDB : "); print(TB); printf("\n\n");
-
-  printf(latex_header);
+      printf("<rect x=\"%d\" y=\"%d\" width=\"%d\" height=\"%d\" "
+             "fill=\"rgb(%s)\" stroke=\"black\"/>\n",
+             x, y, cell_width, cell_height, color_map[thr_idx % 8]);
 
-  // C starting at 0,0
-  for (int m = 0; m < size<0>(C); ++m) {
-    for (int n = 0; n < size<1>(C); ++n) {
-      int thrid   = C(m,n) % size(TC);
-      int val_idx = C(m,n) / size(TC);
-      int thr_idx = TC(thrid);
-
-      printf("\\node[box,fill=%s] at (%d,%d) {\\shortstack{T%d \\\\ V%d}};\n",
-             color_map[thr_idx % 8],
-             m, n,
-             thr_idx, val_idx);
+      printf("<text x=\"%d\" y=\"%d\" text-anchor=\"middle\" "
+             "alignment-baseline=\"central\" font-size=\"8\">T%d</text>\n",
+             thr_x, thr_y, thr_idx);
+      printf("<text x=\"%d\" y=\"%d\" text-anchor=\"middle\" "
+             "alignment-baseline=\"central\" font-size=\"8\">V%d</text>\n",
+             val_x, val_y, val_idx);
     }
   }
 
-  // A starting at 0,-size<1>(A)-1
+  // A
+  int a_base_x = cell_width;
+  int a_base_y = (size<1>(B) + 2) * cell_height;
   for (int m = 0; m < size<0>(A); ++m) {
     for (int k = 0; k < size<1>(A); ++k) {
-      int thrid   = A(m,k) % size(TA);
-      int val_idx = A(m,k) / size(TA);
+      int thrid = A(m, k) % size(TA);
+      int val_idx = A(m, k) / size(TA);
       int thr_idx = TA(thrid);
 
-      printf("\\node[box,fill=%s] at (%d,%d) {\\shortstack{T%d \\\\ V%d}};\n",
-             color_map[thr_idx % 8],
-             m, k-1-size<1>(A),
-             thr_idx, val_idx);
+      int x = k * cell_width + a_base_x;
+      int y = m * cell_height + a_base_y;
+
+      int thr_x = x + cell_width / 2;
+      int thr_y = y + cell_height / 4;
+      int val_x = x + cell_width / 2;
+      int val_y = y + cell_height * 3 / 4;
+
+      printf("<rect x=\"%d\" y=\"%d\" width=\"%d\" height=\"%d\" "
+             "fill=\"rgb(%s)\" stroke=\"black\" />\n",
+             x, y, cell_width, cell_height, color_map[thr_idx % 8]);
+      printf("<text x=\"%d\" y=\"%d\" text-anchor=\"middle\" "
+             "alignment-baseline=\"central\" font-size=\"8\">T%d</text>\n",
+             thr_x, thr_y, thr_idx);
+      printf("<text x=\"%d\" y=\"%d\" text-anchor=\"middle\" "
+             "alignment-baseline=\"central\" font-size=\"8\">V%d</text>\n",
+             val_x, val_y, val_idx);
     }
   }
 
-  // B starting at -size<1>(B)-1,0
+  // B
+  int b_base_x = (size<1>(A) + 2) * cell_width;
+  int b_base_y = cell_height;
   for (int n = 0; n < size<0>(B); ++n) {
     for (int k = 0; k < size<1>(B); ++k) {
-      int thrid   = B(n,k) % size(TB);
-      int val_idx = B(n,k) / size(TB);
+      int thrid = B(n, k) % size(TB);
+      int val_idx = B(n, k) / size(TB);
       int thr_idx = TB(thrid);
 
-      printf("\\node[box,fill=%s] at (%d,%d) {\\shortstack{T%d \\\\ V%d}};\n",
-             color_map[thr_idx % 8],
-             k-1-size<1>(B), n,
-             thr_idx, val_idx);
+      int x = n * cell_width + b_base_x;
+      int y = k * cell_height + b_base_y;
+
+      int thr_x = x + cell_width / 2;
+      int thr_y = y + cell_height / 4;
+      int val_x = x + cell_width / 2;
+      int val_y = y + cell_height * 3 / 4;
+
+      printf("<rect x=\"%d\" y=\"%d\" width=\"%d\" height=\"%d\" "
+             "fill=\"rgb(%s)\" stroke=\"black\" />\n",
+             x, y, cell_width, cell_height, color_map[thr_idx % 8]);
+      printf("<text x=\"%d\" y=\"%d\" text-anchor=\"middle\" "
+             "alignment-baseline=\"central\" font-size=\"8\">T%d</text>\n",
+             thr_x, thr_y, thr_idx);
+      printf("<text x=\"%d\" y=\"%d\" text-anchor=\"middle\" "
+             "alignment-baseline=\"central\" font-size=\"8\">V%d</text>\n",
+             val_x, val_y, val_idx);
     }
   }
 
   // A labels
-  for (int m = 0, k = -1; m < size<0>(A); ++m) {
-    printf("\\node at (%d,%d) {\\Large{\\texttt{%d}}};\n", m, k-1-size<1>(A), m);
+  for (int m = 0; m < size<0>(A); ++m) {
+    int x = cell_width / 2;
+    int y = m * cell_height + cell_height / 2 + a_base_y;
+    printf("<text x=\"%d\" y=\"%d\" text-anchor=\"middle\" "
+           "alignment-baseline=\"central\" font-size=\"12\">%d</text>\n",
+           x, y, m);
   }
-  for (int k = 0, m = -1; k < size<1>(A); ++k) {
-    printf("\\node at (%d,%d) {\\Large{\\texttt{%d}}};\n", m, k-1-size<1>(A), k);
+  for (int k = 0; k < size<1>(A); ++k) {
+    int x = cell_width + k * cell_width + cell_width / 2;
+    int y = -cell_height / 2 + a_base_y;
+    printf("<text x=\"%d\" y=\"%d\" text-anchor=\"middle\" "
+           "alignment-baseline=\"central\" font-size=\"12\">%d</text>\n",
+           x, y, k);
   }
+
   // B labels
-  for (int n = 0, k = -1; n < size<0>(B); ++n) {
-    printf("\\node at (%d,%d) {\\Large{\\texttt{%d}}};\n", k-1-size<1>(B), n, n);
+  for (int n = 0; n < size<0>(B); ++n) {
+    int x = b_base_x + cell_width * n + cell_width / 2;
+    int y = cell_height / 2;
+    printf("<text x=\"%d\" y=\"%d\" text-anchor=\"middle\" "
+           "alignment-baseline=\"central\" font-size=\"12\">%d</text>\n",
+           x, y, n);
   }
-  for (int k = 0, n = -1; k < size<1>(B); ++k) {
-    printf("\\node at (%d,%d) {\\Large{\\texttt{%d}}};\n", k-1-size<1>(B), n, k);
+  for (int k = 0; k < size<1>(B); ++k) {
+    int x = b_base_x - cell_width / 2;
+    int y = cell_height * (k + 1) + cell_height / 2;
+    printf("<text x=\"%d\" y=\"%d\" text-anchor=\"middle\" "
+           "alignment-baseline=\"central\" font-size=\"12\">%d</text>\n",
+           x, y, k);
   }
 
-  // Footer
-  printf(latex_footer);
+  // footer
+  printf("</svg>");
+}
+
+template <class... Args>
+CUTE_HOST_DEVICE
+void
+print_svg(MMA_Atom<Args...> const &mma_atom) {
+  print_svg(make_tiled_mma(mma_atom));
+}
+
+template <class... Args>
+CUTE_HOST_DEVICE
+void
+print_svg(TiledMMA<Args...> const &mma) {
+  auto layout_and_thrid_C = mma.get_layoutC_MN();
+  auto layoutC_MN = get<0>(layout_and_thrid_C);
+  auto thrID_C = get<1>(layout_and_thrid_C);
+
+  auto layout_and_thrid_A = mma.get_layoutA_MK();
+  auto layoutA_MK = get<0>(layout_and_thrid_A);
+  auto thrID_A = get<1>(layout_and_thrid_A);
+
+  auto layout_and_thrid_B = mma.get_layoutB_NK();
+  auto layoutB_NK = get<0>(layout_and_thrid_B);
+  auto thrID_B = get<1>(layout_and_thrid_B);
+
+  print_svg_mma(layoutC_MN, thrID_C, layoutA_MK, thrID_A, layoutB_NK, thrID_B);
 }
 
 } // namespace cute
diff --git a/include/cute/atom/mma_traits.hpp b/include/cute/atom/mma_traits.hpp
index 55e3edeb50..b5569fc2f5 100644
--- a/include/cute/atom/mma_traits.hpp
+++ b/include/cute/atom/mma_traits.hpp
@@ -30,23 +30,14 @@
  **************************************************************************************************/
 #pragma once
 
-#include <cute/arch/mma.hpp>
-
-#include <cute/tensor_impl.hpp>
+#include <cute/tensor_impl.hpp>  // cute::Tensor
+#include <cute/pointer.hpp>      // cute::is_rmem
+#include <cute/arch/mma.hpp>     // cute::UniversalFMA
+#include <cute/arch/util.hpp>    // cute::detail::explode
 
 namespace cute
 {
 
-namespace detail {
-
-template <class X, class = void>
-struct supports_output_scaling { static constexpr bool value = false; };
-
-template <class X>
-struct supports_output_scaling<X, void_t<decltype(declval<X>().accumulate_)>> { static constexpr bool value = true; };
-
-} // end namespace detail
-
 /**
  * concept MMA_Traits
  * {
@@ -99,17 +90,27 @@ struct MMA_Traits<UniversalFMA<D,A,B,C>>
   using CLayout = Layout<Shape<_1,_1>>;
 };
 
+// Extract an MMA_Op from an MMA_Traits
+template <class MMA_Traits>
+struct MMA_Op {};
+
+template <class MMA_Op_Arg, class... Args>
+struct MMA_Op<MMA_Traits<MMA_Op_Arg, Args...>> {
+  using type = MMA_Op_Arg;
+};
+
 //
 // Generic mma_unpack for any MMA_Traits
 //
-template <class MMA_Op, class... MMA_Args,
+
+template <class AnyMMATraits,
           class TD, class DLayout,
           class TA, class ALayout,
           class TB, class BLayout,
           class TC, class CLayout>
 CUTE_HOST_DEVICE constexpr
 void
-mma_unpack(MMA_Traits<MMA_Op, MMA_Args...> const& traits,
+mma_unpack(AnyMMATraits        const& traits,
            Tensor<TD, DLayout>      & D,
            Tensor<TA, ALayout> const& A,
            Tensor<TB, BLayout> const& B,
@@ -121,115 +122,54 @@ mma_unpack(MMA_Traits<MMA_Op, MMA_Args...> const& traits,
   static_assert(is_rmem<TC>::value, "Expected registers in MMA_Atom::call");
 
   // Register value types from the MMA_Operation register arrays
+  using MMA_Op   = typename MMA_Op<AnyMMATraits>::type;
   using RegTypeD = typename remove_extent<typename MMA_Op::DRegisters>::type;
   using RegTypeA = typename remove_extent<typename MMA_Op::ARegisters>::type;
   using RegTypeB = typename remove_extent<typename MMA_Op::BRegisters>::type;
   using RegTypeC = typename remove_extent<typename MMA_Op::CRegisters>::type;
-  using MMATraits = MMA_Traits<MMA_Op, MMA_Args...>;
 
-  [[maybe_unused]] constexpr int RegNumD = extent<typename MMA_Op::DRegisters>::value;
+  Tensor rA = recast<RegTypeA>(A);
+  Tensor rB = recast<RegTypeB>(B);
+  Tensor rD = recast<RegTypeD>(D);
+  Tensor rC = recast<RegTypeC>(C);
+
+  constexpr int RegNumD = extent<typename MMA_Op::DRegisters>::value;
   constexpr int RegNumA = extent<typename MMA_Op::ARegisters>::value;
   constexpr int RegNumB = extent<typename MMA_Op::BRegisters>::value;
   constexpr int RegNumC = extent<typename MMA_Op::CRegisters>::value;
 
-  Tensor rA = recast<RegTypeA>(A);
-  Tensor rB = recast<RegTypeB>(B);
-
   CUTE_STATIC_ASSERT_V(size(rA) == Int<RegNumA>{});
   CUTE_STATIC_ASSERT_V(size(rB) == Int<RegNumB>{});
+  CUTE_STATIC_ASSERT_V(size(rD) == Int<RegNumD>{});
+  CUTE_STATIC_ASSERT_V(size(rC) == Int<RegNumC>{});
 
-  if constexpr (is_same<RegTypeD, void>::value)
-  {
-    static_assert(is_same<typename TD::value_type, typename TC::value_type>::value, "GMMA C and D value_type must match.");
-    static_assert(is_same<DLayout, CLayout>::value, "GMMA C and D layouts must match.");
-    // assert((void*)&C == (void*)&D);
-
-    Tensor rC = recast<RegTypeC>(D);  // NOTE: D and C are same, so use mutable D
-
-    //CUTE_STATIC_ASSERT_V(size(rC) == Int<RegNumC>{});
-
-    if constexpr (detail::supports_output_scaling<MMATraits>::value) {
-#if defined(CUTLASS_ENABLE_SYCL)
-      detail::explode_mma<MMA_Op>(rA, make_int_sequence<RegNumA>{},
-                                  rB, make_int_sequence<RegNumB>{},
-                                  rC, make_int_sequence<RegNumC>{},
-                                  &(traits.accumulate_), seq<0>{});
-#else
-      detail::explode(MMA_Op::fma,
-                      rA, make_int_sequence<RegNumA>{},
-                      rB, make_int_sequence<RegNumB>{},
-                      rC, make_int_sequence<RegNumC>{},
-                      &(traits.accumulate_), seq<0>{});
-#endif
-    }
-    else {
-#if defined(CUTLASS_ENABLE_SYCL)
-      detail::explode_mma<MMA_Op>(rA, make_int_sequence<RegNumA>{},
-                                  rB, make_int_sequence<RegNumB>{},
-                                  rC, make_int_sequence<RegNumC>{});
-#else
-      detail::explode(MMA_Op::fma,
-                      rA, make_int_sequence<RegNumA>{},
-                      rB, make_int_sequence<RegNumB>{},
-                      rC, make_int_sequence<RegNumC>{});
-#endif
-    }
-  }
-  else {
-      Tensor rD = recast<RegTypeD>(D);
-      Tensor rC = recast<RegTypeC>(C);
-
-      CUTE_STATIC_ASSERT_V(size(rD) == Int<RegNumD>{});
-      CUTE_STATIC_ASSERT_V(size(rC) == Int<RegNumC>{});
-      if constexpr (detail::supports_output_scaling<MMATraits>::value) {
 #if defined(CUTLASS_ENABLE_SYCL)
-        detail::explode_mma<MMA_Op>(rD, make_int_sequence<RegNumD>{},
-                                    rA, make_int_sequence<RegNumA>{},
-                                    rB, make_int_sequence<RegNumB>{},
-                                    rC, make_int_sequence<RegNumC>{},
-                                    &(traits.accumulate_), seq<0>{});
+  detail::explode_mma<MMA_Op>(rD, make_int_sequence<RegNumD>{},
+                          rA, make_int_sequence<RegNumA>{},
+                          rB, make_int_sequence<RegNumB>{},
+                          rC, make_int_sequence<RegNumC>{});
 #else
-        detail::explode(MMA_Op::fma,
-                        rD, make_int_sequence<RegNumD>{},
-                        rA, make_int_sequence<RegNumA>{},
-                        rB, make_int_sequence<RegNumB>{},
-                        rC, make_int_sequence<RegNumC>{},
-                        &(traits.accumulate_), seq<0>{});
+  detail::explode(MMA_Op::fma,
+                  rD, make_int_sequence<RegNumD>{},
+                  rA, make_int_sequence<RegNumA>{},
+                  rB, make_int_sequence<RegNumB>{},
+                  rC, make_int_sequence<RegNumC>{});
 #endif
-      }
-      else {
-#if defined(CUTLASS_ENABLE_SYCL)
-        detail::explode_mma<MMA_Op>(rD, make_int_sequence<RegNumD>{},
-                                    rA, make_int_sequence<RegNumA>{},
-                                    rB, make_int_sequence<RegNumB>{},
-                                    rC, make_int_sequence<RegNumC>{});
-#else
-        detail::explode(MMA_Op::fma,
-                        rD, make_int_sequence<RegNumD>{},
-                        rA, make_int_sequence<RegNumA>{},
-                        rB, make_int_sequence<RegNumB>{},
-                        rC, make_int_sequence<RegNumC>{});
-#endif
-      }
-  }
 }
 
-//
 // Accept mutable temporaries
-//
-
-template <class MMA_Op, class... MMA_Args,
+template <class AnyMMATraits,
           class TD, class DLayout,
           class TA, class ALayout,
           class TB, class BLayout,
           class TC, class CLayout>
 CUTE_HOST_DEVICE constexpr
 void
-mma_unpack(MMA_Traits<MMA_Op, MMA_Args...> const& traits,
-           Tensor<TD, DLayout>     &&  D,
-           Tensor<TA, ALayout> const&  A,
-           Tensor<TB, BLayout> const&  B,
-           Tensor<TC, CLayout> const&  C)
+mma_unpack(AnyMMATraits        const& traits,
+           Tensor<TD, DLayout>     && D,
+           Tensor<TA, ALayout> const& A,
+           Tensor<TB, BLayout> const& B,
+           Tensor<TC, CLayout> const& C)
 {
   mma_unpack(traits, D, A, B, C);
 }
diff --git a/include/cute/atom/mma_traits_sm80.hpp b/include/cute/atom/mma_traits_sm80.hpp
index ab4028811b..706b10d889 100644
--- a/include/cute/atom/mma_traits_sm80.hpp
+++ b/include/cute/atom/mma_traits_sm80.hpp
@@ -433,10 +433,57 @@ struct MMA_Traits<SM80_16x8x256_S32U1U1S32_TN_XORPOPC>
 
   using Shape_MNK = Shape<_16,_8,_256>;
   using ThrID   = Layout<_32>;
-  using ALayout = Layout<Shape <_32,Shape < _8, _4,_2,   _2>>,
-                         Stride<_64,Stride<_64,_16,_8,_2048>>>;
-  using BLayout = Layout<Shape <_32,Shape <_32,   _2>>,
-                         Stride<_32,Stride< _1,_1024>>>;
+  using ALayout = Layout<Shape<Shape<_4,_8>,Shape<_32,_2,_2>>,
+                       Stride<Stride<_512,_1>,Stride<_16,_8,_2048>>>;
+  using BLayout = Layout<Shape<Shape <_4,_8>,Shape<_32,_2>>,
+                         Stride<Stride<_256,_1>,Stride< _8,_1024>>>;
   using CLayout = SM80_16x8_Row;
 };
+
+template <>
+struct MMA_Traits<SM80_16x8x256_S32U1U1S32_TN_ANDPOPC>
+      :MMA_Traits<SM80_16x8x256_S32U1U1S32_TN_XORPOPC> {};
+
+template<>
+struct MMA_Traits<SM80_8x8x128_S32U1U1S32_TN_XORPOPC>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = cute::uint1b_t;
+  using ValTypeB = cute::uint1b_t;
+  using ValTypeC = int32_t;
+
+  using Shape_MNK = Shape<_8,_8,_128>;
+  using ThrID   = Layout<_32>;
+  using ALayout = Layout<Shape<Shape<_4,_8>,_32>,
+                       Stride<Stride<_256,_1>,_8>>;
+  using BLayout = Layout<Shape<Shape<_4,_8>,_32>,
+                         Stride<Stride<_256,_1>,_8>>;
+  using CLayout = SM80_8x8_Row;
+};
+
+template <>
+struct MMA_Traits<SM80_8x8x128_S32U1U1S32_TN_ANDPOPC>
+      :MMA_Traits<SM80_8x8x128_S32U1U1S32_TN_XORPOPC> {};
+
+template<>
+struct MMA_Traits<SM80_16x8x128_S32U1U1S32_TN_XORPOPC>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = cute::uint1b_t;
+  using ValTypeB = cute::uint1b_t;
+  using ValTypeC = int32_t;
+  
+  using Shape_MNK = Shape<_16,_8,_128>;
+  using ThrID   = Layout<_32>;
+  using ALayout = Layout<Shape<Shape<_4,_8>,Shape<_32,_2>>,
+                       Stride<Stride<_512,_1>,Stride<Stride<_16,_8>>>>;
+  using BLayout = Layout<Shape <Shape<_4,_8>,_32>,
+                         Stride<Stride<_256,_1>,_8>>;
+  using CLayout = SM80_16x8_Row;
+};
+
+template <>
+struct MMA_Traits<SM80_16x8x128_S32U1U1S32_TN_ANDPOPC>
+      :MMA_Traits<SM80_16x8x128_S32U1U1S32_TN_XORPOPC> {};
+
 } // end namespace cute
diff --git a/include/cute/atom/mma_traits_sm90.hpp b/include/cute/atom/mma_traits_sm90.hpp
index 437af27b21..b2ced3f878 100644
--- a/include/cute/atom/mma_traits_sm90.hpp
+++ b/include/cute/atom/mma_traits_sm90.hpp
@@ -41,6 +41,8 @@ namespace cute {
 //////////////////////// fp64 = fp64 * fp64 + fp64 ////////////////////////////
 ///////////////////////////////////////////////////////////////////////////////
 
+using SM90_16x8x4_F64F64F64F64_TN = SM90::MMA_16x8x4_F64F64F64F64_TN;
+
 template <>
 struct MMA_Traits<SM90_16x8x4_F64F64F64F64_TN>
 {
@@ -59,6 +61,8 @@ struct MMA_Traits<SM90_16x8x4_F64F64F64F64_TN>
                          Stride<Stride<_32,_1>,Stride<_16,_8>>>;
 };
 
+using SM90_16x8x8_F64F64F64F64_TN = SM90::MMA_16x8x8_F64F64F64F64_TN;
+
 template <>
 struct MMA_Traits<SM90_16x8x8_F64F64F64F64_TN>
 {
@@ -77,6 +81,8 @@ struct MMA_Traits<SM90_16x8x8_F64F64F64F64_TN>
                          Stride<Stride<_32,_1>,Stride<_16,_8>>>;
 };
 
+using SM90_16x8x16_F64F64F64F64_TN = SM90::MMA_16x8x16_F64F64F64F64_TN;
+
 template <>
 struct MMA_Traits<SM90_16x8x16_F64F64F64F64_TN>
 {
@@ -99,9 +105,11 @@ struct MMA_Traits<SM90_16x8x16_F64F64F64F64_TN>
 //////////////////////// cfp64 = cfp64 * cfp64 + cfp64 ////////////////////////////
 ///////////////////////////////////////////////////////////////////////////////////
 
+using SM90_16x8x4_C64C64C64C64_TN  = SM90::MMA_16x8x4_C64C64C64C64_TN;
+
 template <>
 struct MMA_Traits<SM90_16x8x4_C64C64C64C64_TN>
-     : MMA_Traits<SM90_16x8x4_F64F64F64F64_TN> 
+     : MMA_Traits<SM90_16x8x4_F64F64F64F64_TN>
 {
   using ValTypeD = complex<double>;
   using ValTypeA = complex<double>;
@@ -109,9 +117,11 @@ struct MMA_Traits<SM90_16x8x4_C64C64C64C64_TN>
   using ValTypeC = complex<double>;
 };
 
+using SM90_16x8x8_C64C64C64C64_TN  = SM90::MMA_16x8x8_C64C64C64C64_TN;
+
 template <>
 struct MMA_Traits<SM90_16x8x8_C64C64C64C64_TN>
-     : MMA_Traits<SM90_16x8x8_F64F64F64F64_TN> 
+     : MMA_Traits<SM90_16x8x8_F64F64F64F64_TN>
 {
   using ValTypeD = complex<double>;
   using ValTypeA = complex<double>;
@@ -119,9 +129,11 @@ struct MMA_Traits<SM90_16x8x8_C64C64C64C64_TN>
   using ValTypeC = complex<double>;
 };
 
+using SM90_16x8x16_C64C64C64C64_TN = SM90::MMA_16x8x16_C64C64C64C64_TN;
+
 template <>
 struct MMA_Traits<SM90_16x8x16_C64C64C64C64_TN>
-     : MMA_Traits<SM90_16x8x16_F64F64F64F64_TN> 
+     : MMA_Traits<SM90_16x8x16_F64F64F64F64_TN>
 {
   using ValTypeD = complex<double>;
   using ValTypeA = complex<double>;
diff --git a/include/cute/atom/mma_traits_sm90_gmma.hpp b/include/cute/atom/mma_traits_sm90_gmma.hpp
index 3a4fdfa1a5..b02f5b3afd 100644
--- a/include/cute/atom/mma_traits_sm90_gmma.hpp
+++ b/include/cute/atom/mma_traits_sm90_gmma.hpp
@@ -30,10 +30,15 @@
  **************************************************************************************************/
 #pragma once
 
-#include <cute/arch/mma_sm90.hpp>
-#include <cute/atom/mma_traits.hpp>
-
-#include <cute/tensor.hpp>
+#include <cute/pointer_flagged.hpp>            // cute::smem_ptr_flag
+#include <cute/pointer_sparse.hpp>             // cute::smem_sparse_ptr_flag
+#include <cute/swizzle.hpp>                    // cute::Swizzle
+#include <cute/tensor_impl.hpp>                // cute::Tensor
+#include <cute/arch/mma_sm90_desc.hpp>         // cute::LayoutType
+#include <cute/arch/mma_sm90_gmma.hpp>         // cute::SM90_64x8x16_F16F16F16_SS, etc
+#include <cute/atom/mma_traits.hpp>            // cute::MMA_Traits
+#include <cute/layout_composed.hpp>            // cute::ComposedLayout
+#include <cute/numeric/integral_constant.hpp>  // cute::is_static
 
 namespace cute {
 
@@ -60,7 +65,7 @@ warpgroup_fence_operand(Tensor<Engine, Layout>& frg) {
   }
 }
 
-namespace GMMA {
+namespace SM90::GMMA {
 
 ///////////////////////////////////////////
 // Common layouts for GMMA Shared Memory //
@@ -99,20 +104,20 @@ template <class Type>
 using Layout_K_SW128_Atom = decltype(upcast<sizeof_bits<Type>::value>(Layout_K_SW128_Atom_Bits{}));
 
 // With GMMA::Major param
-template <class Type, GMMA::Major tnsp>
-using Layout_INTER_Atom = typename conditional<tnsp == GMMA::Major::MN,
+template <class Type, Major tnsp>
+using Layout_INTER_Atom = typename conditional<tnsp == Major::MN,
                                                Layout_MN_INTER_Atom<Type>,
                                                Layout_K_INTER_Atom<Type>>::type;
-template <class Type, GMMA::Major tnsp>
-using Layout_SW32_Atom = typename conditional<tnsp == GMMA::Major::MN,
+template <class Type, Major tnsp>
+using Layout_SW32_Atom = typename conditional<tnsp == Major::MN,
                                               Layout_MN_SW32_Atom<Type>,
                                               Layout_K_SW32_Atom<Type>>::type;
-template <class Type, GMMA::Major tnsp>
-using Layout_SW64_Atom = typename conditional<tnsp == GMMA::Major::MN,
+template <class Type, Major tnsp>
+using Layout_SW64_Atom = typename conditional<tnsp == Major::MN,
                                               Layout_MN_SW64_Atom<Type>,
                                               Layout_K_SW64_Atom<Type>>::type;
-template <class Type, GMMA::Major tnsp>
-using Layout_SW128_Atom = typename conditional<tnsp == GMMA::Major::MN,
+template <class Type, Major tnsp>
+using Layout_SW128_Atom = typename conditional<tnsp == Major::MN,
                                                Layout_MN_SW128_Atom<Type>,
                                                Layout_K_SW128_Atom<Type>>::type;
 
@@ -188,7 +193,7 @@ layout_type(Tensor<Engine, Layout<Shape,Stride>> const&)
 *   auto smem_layout = tile_to_shape(Layout_K_SW128_Atom<value_type>{}, Shape<_128,_64>{});
 * is guaranteed to be accepted by make_gmma_desc<Major::K> for appropriate value_type.
 */
-template <GMMA::Major MajorMode, class TEngine, class TLayout>
+template <Major MajorMode, class TEngine, class TLayout>
 CUTE_HOST_DEVICE constexpr
 GmmaDescriptor
 make_gmma_desc(Tensor<TEngine,TLayout> const& tensor)
@@ -203,7 +208,7 @@ make_gmma_desc(Tensor<TEngine,TLayout> const& tensor)
   GmmaDescriptor desc;
 
   // Layout type
-  constexpr GMMA::LayoutType LAYOUT_TYPE = GMMA::layout_type(u128_tensor);
+  constexpr LayoutType LAYOUT_TYPE = layout_type(u128_tensor);
   desc.bitfield.layout_type_ = uint8_t(LAYOUT_TYPE);
 
   // Start address (4LSB not included)
@@ -214,12 +219,12 @@ make_gmma_desc(Tensor<TEngine,TLayout> const& tensor)
   desc.bitfield.base_offset_ = base_offset;
 
   // LayoutType meta
-  constexpr int W = LAYOUT_TYPE == GMMA::LayoutType::INTERLEAVE ? 1 :
-                    LAYOUT_TYPE == GMMA::LayoutType::B32        ? 2 :
-                    LAYOUT_TYPE == GMMA::LayoutType::B64        ? 4 :
-                    LAYOUT_TYPE == GMMA::LayoutType::B128       ? 8 : -1;
+  constexpr int W = LAYOUT_TYPE == LayoutType::INTERLEAVE ? 1 :
+                    LAYOUT_TYPE == LayoutType::B32        ? 2 :
+                    LAYOUT_TYPE == LayoutType::B64        ? 4 :
+                    LAYOUT_TYPE == LayoutType::B128       ? 8 : -1;
 
-  if constexpr (MajorMode == GMMA::Major::MN)
+  if constexpr (MajorMode == Major::MN)
   {
     /* In units of uint128_t, each GmmaDescriptor Major-MN describes a canonical layout of the form
      *
@@ -228,8 +233,10 @@ make_gmma_desc(Tensor<TEngine,TLayout> const& tensor)
      * LayoutType::B64                : Swizzle<2,4,3> o smem_ptr o ((4,n),(8,k)):((1,LBO),(4,SBO))
      * LayoutType::B128               : Swizzle<3,4,3> o smem_ptr o ((8,n),(8,k)):((1,LBO),(8,SBO))
      */
-    static_assert(size<1>(u128_tensor) == Int<(256 / cute::sizeof_bits<value_type>::value)>{},      // K size
-                         "Not a canonical GMMA_MN Layout: Expected K-size 256/sizeof_bits<T>.");
+    static_assert(size<1>(u128_tensor) == Int<(256 / cute::sizeof_bits<value_type>::value)>{} || // A and B in dense MMA
+                  size<1>(u128_tensor) == Int<(128 / cute::sizeof_bits<value_type>::value)>{} || // A in sparse MMA
+                  size<1>(u128_tensor) == Int<(512 / cute::sizeof_bits<value_type>::value)>{},   // B in sparse MMA
+                         "Not a canonical GMMA_MN Layout: Expected K-size 256/sizeof_bits<T> for dense or (128|512)/sizeof_bits<T> for sparse.");
 
     // Construct the canonical GMMA T Layout with shape ((W,n),(8,2))
     Layout canonical_layout = logical_divide(layout(u128_tensor), make_tile(Layout<Int<W>,_1>{}, Layout<Int<8>,_1>{}));
@@ -239,7 +246,7 @@ make_gmma_desc(Tensor<TEngine,TLayout> const& tensor)
     CUTE_STATIC_ASSERT_V(rank<1>(canonical_layout) == Int<2>{}, "Not a canonical GMMA_MN Layout: No flat offset mode");
     // Check canonical mode strides
     constexpr uint32_t stride_00 = stride<0,0>(canonical_layout);
-    constexpr uint32_t expected_stride_00 = LAYOUT_TYPE == GMMA::LayoutType::INTERLEAVE ? stride<0,0>(canonical_layout) : 1;
+    constexpr uint32_t expected_stride_00 = LAYOUT_TYPE == LayoutType::INTERLEAVE ? stride<0,0>(canonical_layout) : 1;
     static_assert(stride_00 == expected_stride_00, "Not a canonical GMMA_MN Layout: Expected stride failure.");
     constexpr uint32_t stride_10 = stride<1,0>(canonical_layout);
     constexpr uint32_t expected_stride_10 = W;
@@ -249,10 +256,10 @@ make_gmma_desc(Tensor<TEngine,TLayout> const& tensor)
     constexpr uint32_t stride_01 = stride<0,1>(canonical_layout);
     constexpr uint32_t stride_11 = stride<1,1>(canonical_layout);
 
-    desc.bitfield.stride_byte_offset_  = (LAYOUT_TYPE == GMMA::LayoutType::INTERLEAVE) ? stride_01 : stride_11;
-    desc.bitfield.leading_byte_offset_ = (LAYOUT_TYPE == GMMA::LayoutType::INTERLEAVE) ? stride_11 : stride_01;
+    desc.bitfield.stride_byte_offset_  = (LAYOUT_TYPE == LayoutType::INTERLEAVE) ? stride_01 : stride_11;
+    desc.bitfield.leading_byte_offset_ = (LAYOUT_TYPE == LayoutType::INTERLEAVE) ? stride_11 : stride_01;
   }
-  else if constexpr (MajorMode == GMMA::Major::K)
+  else if constexpr (MajorMode == Major::K)
   {
     /* In units of uint128_t, each GmmaDescriptor Major-K describes a canonical layout of the form
      *
@@ -263,8 +270,8 @@ make_gmma_desc(Tensor<TEngine,TLayout> const& tensor)
      */
     CUTE_STATIC_ASSERT_V(size<0>(u128_tensor) % Int<8>{} == Int<0>{},          // N|M size
                          "Not a canonical GMMA_K Layout: Expected MN-size multiple of 8.");
-    CUTE_STATIC_ASSERT_V(size<1>(u128_tensor) == Int<2>{},                     // K   size
-                         "Not a canonical GMMA_K Layout: Expected K-size 2 (in units of uint128_t).");
+    CUTE_STATIC_ASSERT_V(size<1>(u128_tensor) == Int<2>{} || size<1>(u128_tensor) == Int<4>{},      // K   size
+                         "Not a canonical GMMA_K Layout: Expected K-size 2 for dense or 4 for sparse (in units of uint128_t).");
 
     // Construct the canonical GMMA N Layout with shape ((8,n),(2,1))
     Layout canonical_layout = logical_divide(layout(u128_tensor), make_tile(Layout<_8,_1>{}, Layout<_2,_1>{}));
@@ -277,7 +284,7 @@ make_gmma_desc(Tensor<TEngine,TLayout> const& tensor)
     constexpr uint32_t expected_stride_00 = W;
     static_assert(stride_00 == expected_stride_00, "Not a canonical GMMA_K Layout: Expected stride failure.");
     constexpr uint32_t stride_10 = stride<1,0>(canonical_layout);
-    constexpr uint32_t expected_stride_10 = (LAYOUT_TYPE == GMMA::LayoutType::INTERLEAVE) ? stride<1,0>(canonical_layout) : 1;
+    constexpr uint32_t expected_stride_10 = (LAYOUT_TYPE == LayoutType::INTERLEAVE) ? stride<1,0>(canonical_layout) : 1;
     static_assert(stride_10 == expected_stride_10, "Not a canonical GMMA_K Layout: Expected stride failure.");
 
     // stride dimension byte offset and leading dimension byte offset (4LSB not included == uint128_t units)
@@ -286,7 +293,7 @@ make_gmma_desc(Tensor<TEngine,TLayout> const& tensor)
     desc.bitfield.stride_byte_offset_  = stride_01;
     desc.bitfield.leading_byte_offset_ = stride_10;
   } else {
-    static_assert(MajorMode != GMMA::Major::MN && MajorMode != GMMA::Major::K, "Unrecognized MajorMode!");
+    static_assert(MajorMode != Major::MN && MajorMode != Major::K, "Unrecognized MajorMode!");
   }
 
 #if 0
@@ -357,21 +364,21 @@ print(DescriptorIterator) {
 
 // The GMMA Traits below have custom fragment type flags for their smem desc tensors.
 // These flags specialize a MakeTensor customization point to correctly make the fragment that is desired.
-template <GMMA::Major>
+template <Major>
 struct smem_desc : DescriptorIterator {};
 
-} // end namespace GMMA
+} // end namespace SM90::GMMA
 
 // Customization point for creating a GMMA::smem_desc Tensor
-template <GMMA::Major MajorMode>
-struct MakeTensor<GMMA::smem_desc<MajorMode>>
+template <SM90::GMMA::Major MajorMode>
+struct MakeTensor<SM90::GMMA::smem_desc<MajorMode>>
 {
   template <class TEngine, class TLayout>
   CUTE_HOST_DEVICE constexpr auto
   operator()(Tensor<TEngine,TLayout> const& smem_tensor)
   {
     static_assert(is_smem<TEngine>::value, "Expected SMEM Tensor to construct a GMMA Desc Tensor");
-    return make_tensor(GMMA::DescriptorIterator{GMMA::make_gmma_desc<MajorMode>(tensor<0>(smem_tensor))},
+    return make_tensor(SM90::GMMA::DescriptorIterator{SM90::GMMA::make_gmma_desc<MajorMode>(tensor<0>(smem_tensor))},
                        replace<0>(recast<uint128_t const>(smem_tensor).layout(), Layout<_1,_0>{}));
   }
 };
@@ -380,74 +387,105 @@ struct MakeTensor<GMMA::smem_desc<MajorMode>>
 //////////////////////////// MMA_TRAITS ///////////////////////////////////////
 ///////////////////////////////////////////////////////////////////////////////
 
-namespace GMMA {
-
-// Accumulator layouts
-using CLayout_64x8   = Layout<Shape <Shape <  _4,_8, _4>,Shape < _2,_2>>,
-                              Stride<Stride<_128,_1,_16>,Stride<_64,_8>>>;
-
-using CLayout_64x16  = Layout<Shape <Shape <  _4,_8, _4>,Shape < _2,_2,  _2>>,
-                              Stride<Stride<_128,_1,_16>,Stride<_64,_8,_512>>>;
-
-using CLayout_64x32  = Layout<Shape <Shape <  _4,_8, _4>,Shape < _2,_2,  _4>>,
-                              Stride<Stride<_128,_1,_16>,Stride<_64,_8,_512>>>;
-
-using CLayout_64x48  = Layout<Shape <Shape <  _4,_8, _4>,Shape < _2,_2, _6>>,
-                              Stride<Stride<_128,_1,_16>,Stride<_64,_8,_512>>>;
-
-using CLayout_64x64  = Layout<Shape <Shape <  _4,_8, _4>,Shape < _2,_2,  _8>>,
-                              Stride<Stride<_128,_1,_16>,Stride<_64,_8,_512>>>;
-
-using CLayout_64x80  = Layout<Shape <Shape <  _4,_8, _4>,Shape < _2,_2, _10>>,
-                              Stride<Stride<_128,_1,_16>,Stride<_64,_8,_512>>>;
-
-using CLayout_64x96  = Layout<Shape <Shape <  _4,_8, _4>,Shape < _2,_2, _12>>,
-                              Stride<Stride<_128,_1,_16>,Stride<_64,_8,_512>>>;
-
-using CLayout_64x112  = Layout<Shape <Shape <  _4,_8, _4>,Shape < _2,_2, Int<14>>>,
-                              Stride<Stride<_128,_1,_16>,Stride<_64,_8,_512>>>;
-
-using CLayout_64x128 = Layout<Shape <Shape <  _4,_8, _4>,Shape < _2,_2, _16>>,
-                              Stride<Stride<_128,_1,_16>,Stride<_64,_8,_512>>>;
-
-using CLayout_64x144  = Layout<Shape <Shape <  _4,_8, _4>,Shape < _2,_2, Int<18>>>,
-                              Stride<Stride<_128,_1,_16>,Stride<_64,_8,_512>>>;
-
-using CLayout_64x160  = Layout<Shape <Shape <  _4,_8, _4>,Shape < _2,_2, Int<20>>>,
-                              Stride<Stride<_128,_1,_16>,Stride<_64,_8,_512>>>;
-
-using CLayout_64x176  = Layout<Shape <Shape <  _4,_8, _4>,Shape < _2,_2, Int<22>>>,
-                              Stride<Stride<_128,_1,_16>,Stride<_64,_8,_512>>>;
-
-using CLayout_64x192 = Layout<Shape <Shape <  _4,_8, _4>,Shape < _2,_2, _24>>,
-                              Stride<Stride<_128,_1,_16>,Stride<_64,_8,_512>>>;
+namespace SM90::GMMA {
 
-using CLayout_64x224  = Layout<Shape <Shape <  _4,_8, _4>,Shape < _2,_2, Int<28>>>,
-                              Stride<Stride<_128,_1,_16>,Stride<_64,_8,_512>>>;
+//
+// Specialized mma_unpack implementation for SM90 GMMA instructions
+//
 
-using CLayout_64x240  = Layout<Shape <Shape <  _4,_8, _4>,Shape < _2,_2, Int<30>>>,
-                              Stride<Stride<_128,_1,_16>,Stride<_64,_8,_512>>>;
+template <class MMA_Op, class... MMA_Args,
+          class TD, class DLayout,
+          class TA, class ALayout,
+          class TB, class BLayout,
+          class TC, class CLayout>
+CUTE_HOST_DEVICE constexpr
+void
+mma_unpack(MMA_Traits<MMA_Op, MMA_Args...> const& traits,
+           Tensor<TD, DLayout>      & D,
+           Tensor<TA, ALayout> const& A,
+           Tensor<TB, BLayout> const& B,
+           Tensor<TC, CLayout> const& C)
+{
+  static_assert(is_rmem<TD>::value, "Expected registers in MMA_Atom::call");
+  static_assert(is_rmem<TA>::value, "Expected registers in MMA_Atom::call");
+  static_assert(is_rmem<TB>::value, "Expected registers in MMA_Atom::call");
+  static_assert(is_rmem<TC>::value, "Expected registers in MMA_Atom::call");
+
+  // Register value types from the MMA_Operation register arrays
+  using RegTypeA = typename remove_extent<typename MMA_Op::ARegisters>::type;
+  using RegTypeB = typename remove_extent<typename MMA_Op::BRegisters>::type;
+  using RegTypeC = typename remove_extent<typename MMA_Op::CRegisters>::type;
+
+  // SM90 GMMA take three arguments rather than four, try to assert C and D are aliased
+  static_assert(is_same<typename TD::value_type, typename TC::value_type>::value, "GMMA C and D value_type must match.");
+  static_assert(is_same<DLayout, CLayout>::value, "GMMA C and D layouts must match.");
+  // assert((void*)&C == (void*)&D);
+
+  Tensor rA = recast<RegTypeA>(A);
+  Tensor rB = recast<RegTypeB>(B);
+  Tensor rC = recast<RegTypeC>(D);  // NOTE: D and C are same, so use mutable D
+
+  constexpr int RegNumA = extent<typename MMA_Op::ARegisters>::value;
+  constexpr int RegNumB = extent<typename MMA_Op::BRegisters>::value;
+  constexpr int RegNumC = extent<typename MMA_Op::CRegisters>::value;
+
+  CUTE_STATIC_ASSERT_V(size(rA) == Int<RegNumA>{});
+  CUTE_STATIC_ASSERT_V(size(rB) == Int<RegNumB>{});
+  CUTE_STATIC_ASSERT_V(size(rC) == Int<RegNumC>{});
+
+  detail::explode(MMA_Op::fma,
+                  rA, make_int_sequence<RegNumA>{},
+                  rB, make_int_sequence<RegNumB>{},
+                  rC, make_int_sequence<RegNumC>{},
+                  &(traits.accumulate_), seq<0>{});
+}
 
-using CLayout_64x256 = Layout<Shape <Shape <  _4,_8, _4>,Shape < _2,_2, _32>>,
-                              Stride<Stride<_128,_1,_16>,Stride<_64,_8,_512>>>;
+// Accumulator layouts
+template<int N>
+using CLayout_64xN   = Layout<Shape <Shape <  _4,_8, _4>,Shape < _2,_2,Int<N/8>>>,
+                              Stride<Stride<_128,_1,_16>,Stride<_64,_8,   _512>>>;
+
+using CLayout_64x8   = CLayout_64xN<  8>;
+using CLayout_64x16  = CLayout_64xN< 16>;
+using CLayout_64x32  = CLayout_64xN< 32>;
+using CLayout_64x64  = CLayout_64xN< 64>;
+using CLayout_64x96  = CLayout_64xN< 96>;
+using CLayout_64x128 = CLayout_64xN<128>;
+using CLayout_64x192 = CLayout_64xN<192>;
+using CLayout_64x256 = CLayout_64xN<256>;
 
 // Register source layout for 32-bit value types
 using ALayout_64x8   = Layout<Shape <Shape <  _4,_8, _4>,Shape <    _2,  _2>>,
                               Stride<Stride< _64,_1,_16>,Stride<    _8,_256>>>;
 
-// Register source layout for 16-bit value types
-using ALayout_64x16 = CLayout_64x16;
+// Register source layout for 16-bit (sparse 32-bit) value types
+using ALayout_64x16  = Layout<Shape <Shape <  _4,_8, _4>,Shape < _2,_2,  _2>>,
+                              Stride<Stride<_128,_1,_16>,Stride<_64,_8,_512>>>;
+
+// Register source layout for 8-bit (sparse 16-bit) value types
+using ALayout_64x32  = Layout<Shape <Shape <  _4,_8, _4>,Shape < _4,_2,   _2>>,
+                              Stride<Stride<_256,_1,_16>,Stride<_64,_8,_1024>>>;
 
-// Register source layout for 8-bit value types
-using ALayout_64x32 = Layout<Shape <Shape <  _4,_8, _4>,Shape < _4,_2,   _2>>,
-                             Stride<Stride<_256,_1,_16>,Stride<_64,_8,_1024>>>;
+// Register source layout for sparse 8-bit value types
+using ALayout_64x64  = Layout<Shape <Shape <  _4,_8, _4>,Shape < _8,_2,   _2>>,
+                              Stride<Stride<_512,_1,_16>,Stride<_64,_8,_2048>>>;
 
 // Shared memory source layouts for any value type
 template <int M, int K>
 using ABLayout       = Layout<Shape <_128,Shape <Int<M>,Int<K>>>,
                               Stride<  _0,Stride<    _1,Int<M>>>>;
 
-} // namespace GMMA
+} // end namespace SM90::GMMA
+
+using namespace SM90;
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x8x16_F16F16F16_SS = SM90::GMMA::MMA_64x8x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
 
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
 struct MMA_Traits<SM90_64x8x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
@@ -471,6 +509,14 @@ struct MMA_Traits<SM90_64x8x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x8x16_F16F16F16_RS = SM90::GMMA::MMA_64x8x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
 struct MMA_Traits<SM90_64x8x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
 {
@@ -492,6 +538,14 @@ struct MMA_Traits<SM90_64x8x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x16x16_F16F16F16_SS = SM90::GMMA::MMA_64x16x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
 struct MMA_Traits<SM90_64x16x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
 {
@@ -514,6 +568,14 @@ struct MMA_Traits<SM90_64x16x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x16x16_F16F16F16_RS = SM90::GMMA::MMA_64x16x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
 struct MMA_Traits<SM90_64x16x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
 {
@@ -535,6 +597,14 @@ struct MMA_Traits<SM90_64x16x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x32x16_F16F16F16_SS = SM90::GMMA::MMA_64x32x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
 struct MMA_Traits<SM90_64x32x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
 {
@@ -557,6 +627,14 @@ struct MMA_Traits<SM90_64x32x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x32x16_F16F16F16_RS = SM90::GMMA::MMA_64x32x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
 struct MMA_Traits<SM90_64x32x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
 {
@@ -578,9 +656,16 @@ struct MMA_Traits<SM90_64x32x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x64x16_F16F16F16_SS = SM90::GMMA::MMA_64x64x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x48x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+struct MMA_Traits<SM90_64x64x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
 {
   using ValTypeD = half_t;
   using ValTypeA = half_t;
@@ -590,21 +675,27 @@ struct MMA_Traits<SM90_64x48x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
   using FrgTypeA = GMMA::smem_desc<tnspA>;
   using FrgTypeB = GMMA::smem_desc<tnspB>;
 
-  using Shape_MNK = Shape<_64,_48,_16>;
+  using Shape_MNK = Shape<_64,_64,_16>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout< 48, 16>;
-  using CLayout = GMMA::CLayout_64x48;
+  using BLayout = GMMA::ABLayout< 64, 16>;
+  using CLayout = GMMA::CLayout_64x64;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x64x16_F16F16F16_RS = SM90::GMMA::MMA_64x64x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x48x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+struct MMA_Traits<SM90_64x64x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
 {
   using ValTypeD = half_t;
   using ValTypeA = half_t;
@@ -613,20 +704,27 @@ struct MMA_Traits<SM90_64x48x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
 
   using FrgTypeB = GMMA::smem_desc<tnspB>;
 
-  using Shape_MNK = Shape<_64,_48,_16>;
+  using Shape_MNK = Shape<_64,_64,_16>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout< 48, 16>;
-  using CLayout = GMMA::CLayout_64x48;
+  using BLayout = GMMA::ABLayout< 64, 16>;
+  using CLayout = GMMA::CLayout_64x64;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x96x16_F16F16F16_SS = SM90::GMMA::MMA_64x96x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x64x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+struct MMA_Traits<SM90_64x96x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
 {
   using ValTypeD = half_t;
   using ValTypeA = half_t;
@@ -636,19 +734,27 @@ struct MMA_Traits<SM90_64x64x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
   using FrgTypeA = GMMA::smem_desc<tnspA>;
   using FrgTypeB = GMMA::smem_desc<tnspB>;
 
-  using Shape_MNK = Shape<_64,_64,_16>;
+  using Shape_MNK = Shape<_64,_96,_16>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout< 64, 16>;
-  using CLayout = GMMA::CLayout_64x64;
+  using BLayout = GMMA::ABLayout< 96, 16>;
+  using CLayout = GMMA::CLayout_64x96;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x96x16_F16F16F16_RS = SM90::GMMA::MMA_64x96x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x64x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+struct MMA_Traits<SM90_64x96x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
 {
   using ValTypeD = half_t;
   using ValTypeA = half_t;
@@ -657,20 +763,27 @@ struct MMA_Traits<SM90_64x64x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
 
   using FrgTypeB = GMMA::smem_desc<tnspB>;
 
-  using Shape_MNK = Shape<_64,_64,_16>;
+  using Shape_MNK = Shape<_64,_96,_16>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout< 64, 16>;
-  using CLayout = GMMA::CLayout_64x64;
+  using BLayout = GMMA::ABLayout< 96, 16>;
+  using CLayout = GMMA::CLayout_64x96;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x128x16_F16F16F16_SS = SM90::GMMA::MMA_64x128x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x80x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+struct MMA_Traits<SM90_64x128x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
 {
   using ValTypeD = half_t;
   using ValTypeA = half_t;
@@ -680,21 +793,27 @@ struct MMA_Traits<SM90_64x80x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
   using FrgTypeA = GMMA::smem_desc<tnspA>;
   using FrgTypeB = GMMA::smem_desc<tnspB>;
 
-  using Shape_MNK = Shape<_64,_80,_16>;
+  using Shape_MNK = Shape<_64,_128,_16>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout< 80, 16>;
-  using CLayout = GMMA::CLayout_64x80;
+  using BLayout = GMMA::ABLayout<128, 16>;
+  using CLayout = GMMA::CLayout_64x128;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x128x16_F16F16F16_RS = SM90::GMMA::MMA_64x128x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x80x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+struct MMA_Traits<SM90_64x128x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
 {
   using ValTypeD = half_t;
   using ValTypeA = half_t;
@@ -703,20 +822,27 @@ struct MMA_Traits<SM90_64x80x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
 
   using FrgTypeB = GMMA::smem_desc<tnspB>;
 
-  using Shape_MNK = Shape<_64,_80,_16>;
+  using Shape_MNK = Shape<_64,_128,_16>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout< 80, 16>;
-  using CLayout = GMMA::CLayout_64x80;
+  using BLayout = GMMA::ABLayout<128, 16>;
+  using CLayout = GMMA::CLayout_64x128;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x192x16_F16F16F16_SS = SM90::GMMA::MMA_64x192x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x96x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+struct MMA_Traits<SM90_64x192x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
 {
   using ValTypeD = half_t;
   using ValTypeA = half_t;
@@ -726,19 +852,27 @@ struct MMA_Traits<SM90_64x96x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
   using FrgTypeA = GMMA::smem_desc<tnspA>;
   using FrgTypeB = GMMA::smem_desc<tnspB>;
 
-  using Shape_MNK = Shape<_64,_96,_16>;
+  using Shape_MNK = Shape<_64,_192,_16>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout< 96, 16>;
-  using CLayout = GMMA::CLayout_64x96;
+  using BLayout = GMMA::ABLayout<192, 16>;
+  using CLayout = GMMA::CLayout_64x192;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x192x16_F16F16F16_RS = SM90::GMMA::MMA_64x192x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x96x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+struct MMA_Traits<SM90_64x192x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
 {
   using ValTypeD = half_t;
   using ValTypeA = half_t;
@@ -747,20 +881,27 @@ struct MMA_Traits<SM90_64x96x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
 
   using FrgTypeB = GMMA::smem_desc<tnspB>;
 
-  using Shape_MNK = Shape<_64,_96,_16>;
+  using Shape_MNK = Shape<_64,_192,_16>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout< 96, 16>;
-  using CLayout = GMMA::CLayout_64x96;
+  using BLayout = GMMA::ABLayout<192, 16>;
+  using CLayout = GMMA::CLayout_64x192;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x256x16_F16F16F16_SS = SM90::GMMA::MMA_64x256x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x112x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+struct MMA_Traits<SM90_64x256x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
 {
   using ValTypeD = half_t;
   using ValTypeA = half_t;
@@ -770,21 +911,27 @@ struct MMA_Traits<SM90_64x112x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
   using FrgTypeA = GMMA::smem_desc<tnspA>;
   using FrgTypeB = GMMA::smem_desc<tnspB>;
 
-  using Shape_MNK = Shape<_64,_112,_16>;
+  using Shape_MNK = Shape<_64,_256,_16>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout<112, 16>;
-  using CLayout = GMMA::CLayout_64x112;
+  using BLayout = GMMA::ABLayout<256, 16>;
+  using CLayout = GMMA::CLayout_64x256;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x256x16_F16F16F16_RS = SM90::GMMA::MMA_64x256x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x112x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+struct MMA_Traits<SM90_64x256x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
 {
   using ValTypeD = half_t;
   using ValTypeA = half_t;
@@ -793,393 +940,446 @@ struct MMA_Traits<SM90_64x112x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
 
   using FrgTypeB = GMMA::smem_desc<tnspB>;
 
-  using Shape_MNK = Shape<_64,_112,_16>;
+  using Shape_MNK = Shape<_64,_256,_16>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout<112, 16>;
-  using CLayout = GMMA::CLayout_64x112;
+  using BLayout = GMMA::ABLayout<256, 16>;
+  using CLayout = GMMA::CLayout_64x256;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x8x16_F32F16F16_SS = SM90::GMMA::MMA_64x8x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x128x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+struct MMA_Traits<SM90_64x8x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
 {
-  using ValTypeD = half_t;
+  using ValTypeD = float;
   using ValTypeA = half_t;
   using ValTypeB = half_t;
-  using ValTypeC = half_t;
+  using ValTypeC = float;
 
   using FrgTypeA = GMMA::smem_desc<tnspA>;
   using FrgTypeB = GMMA::smem_desc<tnspB>;
 
-  using Shape_MNK = Shape<_64,_128,_16>;
+  using Shape_MNK = Shape<_64,_8,_16>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout<128, 16>;
-  using CLayout = GMMA::CLayout_64x128;
+  using BLayout = GMMA::ABLayout<  8, 16>;
+  using CLayout = GMMA::CLayout_64x8;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x8x16_F32F16F16_RS = SM90::GMMA::MMA_64x8x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x128x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+struct MMA_Traits<SM90_64x8x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
 {
-  using ValTypeD = half_t;
+  using ValTypeD = float;
   using ValTypeA = half_t;
   using ValTypeB = half_t;
-  using ValTypeC = half_t;
+  using ValTypeC = float;
 
   using FrgTypeB = GMMA::smem_desc<tnspB>;
 
-  using Shape_MNK = Shape<_64,_128,_16>;
+  using Shape_MNK = Shape<_64,_8,_16>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout<128, 16>;
-  using CLayout = GMMA::CLayout_64x128;
+  using BLayout = GMMA::ABLayout<  8, 16>;
+  using CLayout = GMMA::CLayout_64x8;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x16x16_F32F16F16_SS = SM90::GMMA::MMA_64x16x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x144x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+struct MMA_Traits<SM90_64x16x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
 {
-  using ValTypeD = half_t;
+  using ValTypeD = float;
   using ValTypeA = half_t;
   using ValTypeB = half_t;
-  using ValTypeC = half_t;
+  using ValTypeC = float;
 
   using FrgTypeA = GMMA::smem_desc<tnspA>;
   using FrgTypeB = GMMA::smem_desc<tnspB>;
 
-  using Shape_MNK = Shape<_64,_144,_16>;
+  using Shape_MNK = Shape<_64,_16,_16>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout<144, 16>;
-  using CLayout = GMMA::CLayout_64x144;
+  using BLayout = GMMA::ABLayout< 16, 16>;
+  using CLayout = GMMA::CLayout_64x16;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x16x16_F32F16F16_RS = SM90::GMMA::MMA_64x16x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x144x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+struct MMA_Traits<SM90_64x16x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
 {
-  using ValTypeD = half_t;
+  using ValTypeD = float;
   using ValTypeA = half_t;
   using ValTypeB = half_t;
-  using ValTypeC = half_t;
+  using ValTypeC = float;
 
   using FrgTypeB = GMMA::smem_desc<tnspB>;
 
-  using Shape_MNK = Shape<_64,_144,_16>;
+  using Shape_MNK = Shape<_64,_16,_16>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout<144, 16>;
-  using CLayout = GMMA::CLayout_64x144;
+  using BLayout = GMMA::ABLayout< 16, 16>;
+  using CLayout = GMMA::CLayout_64x16;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x32x16_F32F16F16_SS = SM90::GMMA::MMA_64x32x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x160x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+struct MMA_Traits<SM90_64x32x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
 {
-  using ValTypeD = half_t;
+  using ValTypeD = float;
   using ValTypeA = half_t;
   using ValTypeB = half_t;
-  using ValTypeC = half_t;
+  using ValTypeC = float;
 
   using FrgTypeA = GMMA::smem_desc<tnspA>;
   using FrgTypeB = GMMA::smem_desc<tnspB>;
 
-  using Shape_MNK = Shape<_64,_160,_16>;
+  using Shape_MNK = Shape<_64,_32,_16>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout<160, 16>;
-  using CLayout = GMMA::CLayout_64x160;
+  using BLayout = GMMA::ABLayout< 32, 16>;
+  using CLayout = GMMA::CLayout_64x32;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x32x16_F32F16F16_RS = SM90::GMMA::MMA_64x32x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x160x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+struct MMA_Traits<SM90_64x32x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
 {
-  using ValTypeD = half_t;
+  using ValTypeD = float;
   using ValTypeA = half_t;
   using ValTypeB = half_t;
-  using ValTypeC = half_t;
+  using ValTypeC = float;
 
   using FrgTypeB = GMMA::smem_desc<tnspB>;
 
-  using Shape_MNK = Shape<_64,_160,_16>;
+  using Shape_MNK = Shape<_64,_32,_16>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout<160, 16>;
-  using CLayout = GMMA::CLayout_64x160;
+  using BLayout = GMMA::ABLayout< 32, 16>;
+  using CLayout = GMMA::CLayout_64x32;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x64x16_F32F16F16_SS = SM90::GMMA::MMA_64x64x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x176x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+struct MMA_Traits<SM90_64x64x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
 {
-  using ValTypeD = half_t;
+  using ValTypeD = float;
   using ValTypeA = half_t;
   using ValTypeB = half_t;
-  using ValTypeC = half_t;
+  using ValTypeC = float;
 
   using FrgTypeA = GMMA::smem_desc<tnspA>;
   using FrgTypeB = GMMA::smem_desc<tnspB>;
 
-  using Shape_MNK = Shape<_64,_176,_16>;
+  using Shape_MNK = Shape<_64,_64,_16>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout<176, 16>;
-  using CLayout = GMMA::CLayout_64x176;
+  using BLayout = GMMA::ABLayout< 64, 16>;
+  using CLayout = GMMA::CLayout_64x64;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x64x16_F32F16F16_RS = SM90::GMMA::MMA_64x64x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x176x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+struct MMA_Traits<SM90_64x64x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
 {
-  using ValTypeD = half_t;
+  using ValTypeD = float;
   using ValTypeA = half_t;
   using ValTypeB = half_t;
-  using ValTypeC = half_t;
+  using ValTypeC = float;
 
   using FrgTypeB = GMMA::smem_desc<tnspB>;
 
-  using Shape_MNK = Shape<_64,_176,_16>;
+  using Shape_MNK = Shape<_64,_64,_16>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout<176, 16>;
-  using CLayout = GMMA::CLayout_64x176;
+  using BLayout = GMMA::ABLayout< 64, 16>;
+  using CLayout = GMMA::CLayout_64x64;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x96x16_F32F16F16_SS = SM90::GMMA::MMA_64x96x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x192x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+struct MMA_Traits<SM90_64x96x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
 {
-  using ValTypeD = half_t;
+  using ValTypeD = float;
   using ValTypeA = half_t;
   using ValTypeB = half_t;
-  using ValTypeC = half_t;
+  using ValTypeC = float;
 
   using FrgTypeA = GMMA::smem_desc<tnspA>;
   using FrgTypeB = GMMA::smem_desc<tnspB>;
 
-  using Shape_MNK = Shape<_64,_192,_16>;
+  using Shape_MNK = Shape<_64,_96,_16>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout<192, 16>;
-  using CLayout = GMMA::CLayout_64x192;
+  using BLayout = GMMA::ABLayout< 96, 16>;
+  using CLayout = GMMA::CLayout_64x96;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x96x16_F32F16F16_RS = SM90::GMMA::MMA_64x96x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x192x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+struct MMA_Traits<SM90_64x96x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
 {
-  using ValTypeD = half_t;
+  using ValTypeD = float;
   using ValTypeA = half_t;
   using ValTypeB = half_t;
-  using ValTypeC = half_t;
+  using ValTypeC = float;
 
   using FrgTypeB = GMMA::smem_desc<tnspB>;
 
-  using Shape_MNK = Shape<_64,_192,_16>;
+  using Shape_MNK = Shape<_64,_96,_16>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout<192, 16>;
-  using CLayout = GMMA::CLayout_64x192;
+  using BLayout = GMMA::ABLayout< 96, 16>;
+  using CLayout = GMMA::CLayout_64x96;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x128x16_F32F16F16_SS = SM90::GMMA::MMA_64x128x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x208x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+struct MMA_Traits<SM90_64x128x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
 {
-  using ValTypeD = half_t;
+  using ValTypeD = float;
   using ValTypeA = half_t;
   using ValTypeB = half_t;
-  using ValTypeC = half_t;
+  using ValTypeC = float;
 
   using FrgTypeA = GMMA::smem_desc<tnspA>;
   using FrgTypeB = GMMA::smem_desc<tnspB>;
 
-  using Shape_MNK = Shape<_64,_208,_16>;
+  using Shape_MNK = Shape<_64,_128,_16>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout<208, 16>;
-  using CLayout = GMMA::CLayout_64x208;
+  using BLayout = GMMA::ABLayout<128, 16>;
+  using CLayout = GMMA::CLayout_64x128;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x128x16_F32F16F16_RS = SM90::GMMA::MMA_64x128x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x208x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+struct MMA_Traits<SM90_64x128x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
 {
-  using ValTypeD = half_t;
+  using ValTypeD = float;
   using ValTypeA = half_t;
   using ValTypeB = half_t;
-  using ValTypeC = half_t;
+  using ValTypeC = float;
 
   using FrgTypeB = GMMA::smem_desc<tnspB>;
 
-  using Shape_MNK = Shape<_64,_208,_16>;
+  using Shape_MNK = Shape<_64,_128,_16>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout<208, 16>;
-  using CLayout = GMMA::CLayout_64x208;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x224x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
-{
-  using ValTypeD = half_t;
-  using ValTypeA = half_t;
-  using ValTypeB = half_t;
-  using ValTypeC = half_t;
-
-  using FrgTypeA = GMMA::smem_desc<tnspA>;
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
-
-  using Shape_MNK = Shape<_64,_224,_16>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout<224, 16>;
-  using CLayout = GMMA::CLayout_64x224;
+  using BLayout = GMMA::ABLayout<128, 16>;
+  using CLayout = GMMA::CLayout_64x128;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x224x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
-{
-  using ValTypeD = half_t;
-  using ValTypeA = half_t;
-  using ValTypeB = half_t;
-  using ValTypeC = half_t;
-
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
-
-  using Shape_MNK = Shape<_64,_224,_16>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout<224, 16>;
-  using CLayout = GMMA::CLayout_64x224;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x192x16_F32F16F16_SS = SM90::GMMA::MMA_64x192x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x240x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+struct MMA_Traits<SM90_64x192x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
 {
-  using ValTypeD = half_t;
+  using ValTypeD = float;
   using ValTypeA = half_t;
   using ValTypeB = half_t;
-  using ValTypeC = half_t;
+  using ValTypeC = float;
 
   using FrgTypeA = GMMA::smem_desc<tnspA>;
   using FrgTypeB = GMMA::smem_desc<tnspB>;
 
-  using Shape_MNK = Shape<_64,_240,_16>;
+  using Shape_MNK = Shape<_64,_192,_16>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout<240, 16>;
-  using CLayout = GMMA::CLayout_64x240;
+  using BLayout = GMMA::ABLayout<192, 16>;
+  using CLayout = GMMA::CLayout_64x192;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x192x16_F32F16F16_RS = SM90::GMMA::MMA_64x192x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x240x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+struct MMA_Traits<SM90_64x192x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
 {
-  using ValTypeD = half_t;
+  using ValTypeD = float;
   using ValTypeA = half_t;
   using ValTypeB = half_t;
-  using ValTypeC = half_t;
+  using ValTypeC = float;
 
   using FrgTypeB = GMMA::smem_desc<tnspB>;
 
-  using Shape_MNK = Shape<_64,_240,_16>;
+  using Shape_MNK = Shape<_64,_192,_16>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout<240, 16>;
-  using CLayout = GMMA::CLayout_64x240;
+  using BLayout = GMMA::ABLayout<192, 16>;
+  using CLayout = GMMA::CLayout_64x192;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x256x16_F32F16F16_SS = SM90::GMMA::MMA_64x256x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x256x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+struct MMA_Traits<SM90_64x256x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
 {
-  using ValTypeD = half_t;
+  using ValTypeD = float;
   using ValTypeA = half_t;
   using ValTypeB = half_t;
-  using ValTypeC = half_t;
+  using ValTypeC = float;
 
   using FrgTypeA = GMMA::smem_desc<tnspA>;
   using FrgTypeB = GMMA::smem_desc<tnspB>;
@@ -1195,13 +1395,21 @@ struct MMA_Traits<SM90_64x256x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x256x16_F32F16F16_RS = SM90::GMMA::MMA_64x256x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x256x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+struct MMA_Traits<SM90_64x256x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
 {
-  using ValTypeD = half_t;
+  using ValTypeD = float;
   using ValTypeA = half_t;
   using ValTypeB = half_t;
-  using ValTypeC = half_t;
+  using ValTypeC = float;
 
   using FrgTypeB = GMMA::smem_desc<tnspB>;
 
@@ -1216,12 +1424,20 @@ struct MMA_Traits<SM90_64x256x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x8x16_F32BF16BF16_SS = SM90::GMMA::MMA_64x8x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x8x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+struct MMA_Traits<SM90_64x8x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = half_t;
-  using ValTypeB = half_t;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
   using ValTypeC = float;
 
   using FrgTypeA = GMMA::smem_desc<tnspA>;
@@ -1238,12 +1454,20 @@ struct MMA_Traits<SM90_64x8x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x8x16_F32BF16BF16_RS = SM90::GMMA::MMA_64x8x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x8x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+struct MMA_Traits<SM90_64x8x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = half_t;
-  using ValTypeB = half_t;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
   using ValTypeC = float;
 
   using FrgTypeB = GMMA::smem_desc<tnspB>;
@@ -1259,12 +1483,20 @@ struct MMA_Traits<SM90_64x8x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x16x16_F32BF16BF16_SS = SM90::GMMA::MMA_64x16x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x16x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+struct MMA_Traits<SM90_64x16x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = half_t;
-  using ValTypeB = half_t;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
   using ValTypeC = float;
 
   using FrgTypeA = GMMA::smem_desc<tnspA>;
@@ -1281,12 +1513,20 @@ struct MMA_Traits<SM90_64x16x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x16x16_F32BF16BF16_RS = SM90::GMMA::MMA_64x16x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x16x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+struct MMA_Traits<SM90_64x16x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = half_t;
-  using ValTypeB = half_t;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
   using ValTypeC = float;
 
   using FrgTypeB = GMMA::smem_desc<tnspB>;
@@ -1302,12 +1542,20 @@ struct MMA_Traits<SM90_64x16x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x32x16_F32BF16BF16_SS = SM90::GMMA::MMA_64x32x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x32x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+struct MMA_Traits<SM90_64x32x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = half_t;
-  using ValTypeB = half_t;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
   using ValTypeC = float;
 
   using FrgTypeA = GMMA::smem_desc<tnspA>;
@@ -1324,12 +1572,20 @@ struct MMA_Traits<SM90_64x32x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x32x16_F32BF16BF16_RS = SM90::GMMA::MMA_64x32x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x32x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+struct MMA_Traits<SM90_64x32x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = half_t;
-  using ValTypeB = half_t;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
   using ValTypeC = float;
 
   using FrgTypeB = GMMA::smem_desc<tnspB>;
@@ -1345,8450 +1601,801 @@ struct MMA_Traits<SM90_64x32x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x64x16_F32BF16BF16_SS = SM90::GMMA::MMA_64x64x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x48x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+struct MMA_Traits<SM90_64x64x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = half_t;
-  using ValTypeB = half_t;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
   using ValTypeC = float;
 
   using FrgTypeA = GMMA::smem_desc<tnspA>;
   using FrgTypeB = GMMA::smem_desc<tnspB>;
 
-  using Shape_MNK = Shape<_64,_48,_16>;
+  using Shape_MNK = Shape<_64,_64,_16>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout< 48, 16>;
-  using CLayout = GMMA::CLayout_64x48;
+  using BLayout = GMMA::ABLayout< 64, 16>;
+  using CLayout = GMMA::CLayout_64x64;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x64x16_F32BF16BF16_RS = SM90::GMMA::MMA_64x64x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x48x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+struct MMA_Traits<SM90_64x64x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = half_t;
-  using ValTypeB = half_t;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
   using ValTypeC = float;
 
   using FrgTypeB = GMMA::smem_desc<tnspB>;
 
-  using Shape_MNK = Shape<_64,_48,_16>;
+  using Shape_MNK = Shape<_64,_64,_16>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout< 48, 16>;
-  using CLayout = GMMA::CLayout_64x48;
+  using BLayout = GMMA::ABLayout< 64, 16>;
+  using CLayout = GMMA::CLayout_64x64;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x96x16_F32BF16BF16_SS = SM90::GMMA::MMA_64x96x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x64x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+struct MMA_Traits<SM90_64x96x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = half_t;
-  using ValTypeB = half_t;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
   using ValTypeC = float;
 
   using FrgTypeA = GMMA::smem_desc<tnspA>;
   using FrgTypeB = GMMA::smem_desc<tnspB>;
 
-  using Shape_MNK = Shape<_64,_64,_16>;
+  using Shape_MNK = Shape<_64,_96,_16>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout< 64, 16>;
-  using CLayout = GMMA::CLayout_64x64;
+  using BLayout = GMMA::ABLayout< 96, 16>;
+  using CLayout = GMMA::CLayout_64x96;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x96x16_F32BF16BF16_RS = SM90::GMMA::MMA_64x96x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x64x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+struct MMA_Traits<SM90_64x96x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = half_t;
-  using ValTypeB = half_t;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
   using ValTypeC = float;
 
   using FrgTypeB = GMMA::smem_desc<tnspB>;
 
-  using Shape_MNK = Shape<_64,_64,_16>;
+  using Shape_MNK = Shape<_64,_96,_16>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout< 64, 16>;
-  using CLayout = GMMA::CLayout_64x64;
+  using BLayout = GMMA::ABLayout< 96, 16>;
+  using CLayout = GMMA::CLayout_64x96;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x128x16_F32BF16BF16_SS = SM90::GMMA::MMA_64x128x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x80x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+struct MMA_Traits<SM90_64x128x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = half_t;
-  using ValTypeB = half_t;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
   using ValTypeC = float;
 
   using FrgTypeA = GMMA::smem_desc<tnspA>;
   using FrgTypeB = GMMA::smem_desc<tnspB>;
 
-  using Shape_MNK = Shape<_64,_80,_16>;
+  using Shape_MNK = Shape<_64,_128,_16>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout< 80, 16>;
-  using CLayout = GMMA::CLayout_64x80;
+  using BLayout = GMMA::ABLayout<128, 16>;
+  using CLayout = GMMA::CLayout_64x128;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x128x16_F32BF16BF16_RS = SM90::GMMA::MMA_64x128x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x80x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+struct MMA_Traits<SM90_64x128x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = half_t;
-  using ValTypeB = half_t;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
   using ValTypeC = float;
 
   using FrgTypeB = GMMA::smem_desc<tnspB>;
 
-  using Shape_MNK = Shape<_64,_80,_16>;
+  using Shape_MNK = Shape<_64,_128,_16>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout< 80, 16>;
-  using CLayout = GMMA::CLayout_64x80;
+  using BLayout = GMMA::ABLayout<128, 16>;
+  using CLayout = GMMA::CLayout_64x128;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x192x16_F32BF16BF16_SS = SM90::GMMA::MMA_64x192x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x96x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+struct MMA_Traits<SM90_64x192x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = half_t;
-  using ValTypeB = half_t;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
   using ValTypeC = float;
 
   using FrgTypeA = GMMA::smem_desc<tnspA>;
   using FrgTypeB = GMMA::smem_desc<tnspB>;
 
-  using Shape_MNK = Shape<_64,_96,_16>;
+  using Shape_MNK = Shape<_64,_192,_16>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout< 96, 16>;
-  using CLayout = GMMA::CLayout_64x96;
+  using BLayout = GMMA::ABLayout<192, 16>;
+  using CLayout = GMMA::CLayout_64x192;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x192x16_F32BF16BF16_RS = SM90::GMMA::MMA_64x192x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x96x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+struct MMA_Traits<SM90_64x192x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = half_t;
-  using ValTypeB = half_t;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
   using ValTypeC = float;
 
   using FrgTypeB = GMMA::smem_desc<tnspB>;
 
-  using Shape_MNK = Shape<_64,_96,_16>;
+  using Shape_MNK = Shape<_64,_192,_16>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout< 96, 16>;
-  using CLayout = GMMA::CLayout_64x96;
+  using BLayout = GMMA::ABLayout<192, 16>;
+  using CLayout = GMMA::CLayout_64x192;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x256x16_F32BF16BF16_SS = SM90::GMMA::MMA_64x256x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x112x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+struct MMA_Traits<SM90_64x256x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = half_t;
-  using ValTypeB = half_t;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
   using ValTypeC = float;
 
   using FrgTypeA = GMMA::smem_desc<tnspA>;
   using FrgTypeB = GMMA::smem_desc<tnspB>;
 
-  using Shape_MNK = Shape<_64,_112,_16>;
+  using Shape_MNK = Shape<_64,_256,_16>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout<112, 16>;
-  using CLayout = GMMA::CLayout_64x112;
+  using BLayout = GMMA::ABLayout<256, 16>;
+  using CLayout = GMMA::CLayout_64x256;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x256x16_F32BF16BF16_RS = SM90::GMMA::MMA_64x256x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>;
+
 template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x112x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+struct MMA_Traits<SM90_64x256x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = half_t;
-  using ValTypeB = half_t;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
   using ValTypeC = float;
 
   using FrgTypeB = GMMA::smem_desc<tnspB>;
 
-  using Shape_MNK = Shape<_64,_112,_16>;
+  using Shape_MNK = Shape<_64,_256,_16>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout<112, 16>;
-  using CLayout = GMMA::CLayout_64x112;
+  using BLayout = GMMA::ABLayout<256, 16>;
+  using CLayout = GMMA::CLayout_64x256;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x128x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x8x8_F32TF32TF32_SS_TN = SM90::GMMA::MMA_64x8x8_F32TF32TF32_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x8x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = half_t;
-  using ValTypeB = half_t;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
   using ValTypeC = float;
 
-  using FrgTypeA = GMMA::smem_desc<tnspA>;
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_128,_16>;
+  using Shape_MNK = Shape<_64,_8,_8>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout<128, 16>;
-  using CLayout = GMMA::CLayout_64x128;
+  using ALayout = GMMA::ABLayout< 64,  8>;
+  using BLayout = GMMA::ABLayout<  8,  8>;
+  using CLayout = GMMA::CLayout_64x8;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x128x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x8x8_F32TF32TF32_RS_TN = SM90::GMMA::MMA_64x8x8_F32TF32TF32_RS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x8x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = half_t;
-  using ValTypeB = half_t;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
   using ValTypeC = float;
 
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_128,_16>;
+  using Shape_MNK = Shape<_64,_8,_8>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout<128, 16>;
-  using CLayout = GMMA::CLayout_64x128;
+  using ALayout = GMMA::ALayout_64x8;
+  using BLayout = GMMA::ABLayout<  8,  8>;
+  using CLayout = GMMA::CLayout_64x8;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x144x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x16x8_F32TF32TF32_SS_TN = SM90::GMMA::MMA_64x16x8_F32TF32TF32_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x16x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = half_t;
-  using ValTypeB = half_t;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
   using ValTypeC = float;
 
-  using FrgTypeA = GMMA::smem_desc<tnspA>;
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_144,_16>;
+  using Shape_MNK = Shape<_64,_16,_8>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout<144, 16>;
-  using CLayout = GMMA::CLayout_64x144;
+  using ALayout = GMMA::ABLayout< 64,  8>;
+  using BLayout = GMMA::ABLayout< 16,  8>;
+  using CLayout = GMMA::CLayout_64x16;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x144x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x16x8_F32TF32TF32_RS_TN = SM90::GMMA::MMA_64x16x8_F32TF32TF32_RS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x16x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = half_t;
-  using ValTypeB = half_t;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
   using ValTypeC = float;
 
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_144,_16>;
+  using Shape_MNK = Shape<_64,_16,_8>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout<144, 16>;
-  using CLayout = GMMA::CLayout_64x144;
+  using ALayout = GMMA::ALayout_64x8;
+  using BLayout = GMMA::ABLayout< 16,  8>;
+  using CLayout = GMMA::CLayout_64x16;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x160x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x32x8_F32TF32TF32_SS_TN = SM90::GMMA::MMA_64x32x8_F32TF32TF32_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x32x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = half_t;
-  using ValTypeB = half_t;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
   using ValTypeC = float;
 
-  using FrgTypeA = GMMA::smem_desc<tnspA>;
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_160,_16>;
+  using Shape_MNK = Shape<_64,_32,_8>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout<160, 16>;
-  using CLayout = GMMA::CLayout_64x160;
+  using ALayout = GMMA::ABLayout< 64,  8>;
+  using BLayout = GMMA::ABLayout< 32,  8>;
+  using CLayout = GMMA::CLayout_64x32;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x160x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x32x8_F32TF32TF32_RS_TN = SM90::GMMA::MMA_64x32x8_F32TF32TF32_RS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x32x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = half_t;
-  using ValTypeB = half_t;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
   using ValTypeC = float;
 
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_160,_16>;
+  using Shape_MNK = Shape<_64,_32,_8>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout<160, 16>;
-  using CLayout = GMMA::CLayout_64x160;
+  using ALayout = GMMA::ALayout_64x8;
+  using BLayout = GMMA::ABLayout< 32,  8>;
+  using CLayout = GMMA::CLayout_64x32;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x176x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x64x8_F32TF32TF32_SS_TN = SM90::GMMA::MMA_64x64x8_F32TF32TF32_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x64x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = half_t;
-  using ValTypeB = half_t;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
   using ValTypeC = float;
 
-  using FrgTypeA = GMMA::smem_desc<tnspA>;
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_176,_16>;
+  using Shape_MNK = Shape<_64,_64,_8>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout<176, 16>;
-  using CLayout = GMMA::CLayout_64x176;
+  using ALayout = GMMA::ABLayout< 64,  8>;
+  using BLayout = GMMA::ABLayout< 64,  8>;
+  using CLayout = GMMA::CLayout_64x64;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x176x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x64x8_F32TF32TF32_RS_TN = SM90::GMMA::MMA_64x64x8_F32TF32TF32_RS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x64x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = half_t;
-  using ValTypeB = half_t;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
   using ValTypeC = float;
 
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_176,_16>;
+  using Shape_MNK = Shape<_64,_64,_8>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout<176, 16>;
-  using CLayout = GMMA::CLayout_64x176;
+  using ALayout = GMMA::ALayout_64x8;
+  using BLayout = GMMA::ABLayout< 64,  8>;
+  using CLayout = GMMA::CLayout_64x64;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x192x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x96x8_F32TF32TF32_SS_TN = SM90::GMMA::MMA_64x96x8_F32TF32TF32_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x96x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = half_t;
-  using ValTypeB = half_t;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
   using ValTypeC = float;
 
-  using FrgTypeA = GMMA::smem_desc<tnspA>;
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_192,_16>;
+  using Shape_MNK = Shape<_64,_96,_8>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout<192, 16>;
-  using CLayout = GMMA::CLayout_64x192;
+  using ALayout = GMMA::ABLayout< 64,  8>;
+  using BLayout = GMMA::ABLayout< 96,  8>;
+  using CLayout = GMMA::CLayout_64x96;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x192x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x96x8_F32TF32TF32_RS_TN = SM90::GMMA::MMA_64x96x8_F32TF32TF32_RS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x96x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = half_t;
-  using ValTypeB = half_t;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
   using ValTypeC = float;
 
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_192,_16>;
+  using Shape_MNK = Shape<_64,_96,_8>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout<192, 16>;
-  using CLayout = GMMA::CLayout_64x192;
+  using ALayout = GMMA::ALayout_64x8;
+  using BLayout = GMMA::ABLayout< 96,  8>;
+  using CLayout = GMMA::CLayout_64x96;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x208x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x128x8_F32TF32TF32_SS_TN = SM90::GMMA::MMA_64x128x8_F32TF32TF32_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x128x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = half_t;
-  using ValTypeB = half_t;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
   using ValTypeC = float;
 
-  using FrgTypeA = GMMA::smem_desc<tnspA>;
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_208,_16>;
+  using Shape_MNK = Shape<_64,_128,_8>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout<208, 16>;
-  using CLayout = GMMA::CLayout_64x208;
+  using ALayout = GMMA::ABLayout< 64,  8>;
+  using BLayout = GMMA::ABLayout<128,  8>;
+  using CLayout = GMMA::CLayout_64x128;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x208x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x128x8_F32TF32TF32_RS_TN = SM90::GMMA::MMA_64x128x8_F32TF32TF32_RS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x128x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = half_t;
-  using ValTypeB = half_t;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
   using ValTypeC = float;
 
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_208,_16>;
+  using Shape_MNK = Shape<_64,_128,_8>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout<208, 16>;
-  using CLayout = GMMA::CLayout_64x208;
+  using ALayout = GMMA::ALayout_64x8;
+  using BLayout = GMMA::ABLayout<128,  8>;
+  using CLayout = GMMA::CLayout_64x128;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x224x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x192x8_F32TF32TF32_SS_TN = SM90::GMMA::MMA_64x192x8_F32TF32TF32_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x192x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = half_t;
-  using ValTypeB = half_t;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
   using ValTypeC = float;
 
-  using FrgTypeA = GMMA::smem_desc<tnspA>;
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_224,_16>;
+  using Shape_MNK = Shape<_64,_192,_8>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout<224, 16>;
-  using CLayout = GMMA::CLayout_64x224;
+  using ALayout = GMMA::ABLayout< 64,  8>;
+  using BLayout = GMMA::ABLayout<192,  8>;
+  using CLayout = GMMA::CLayout_64x192;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x224x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x192x8_F32TF32TF32_RS_TN = SM90::GMMA::MMA_64x192x8_F32TF32TF32_RS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x192x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = half_t;
-  using ValTypeB = half_t;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
   using ValTypeC = float;
 
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_224,_16>;
+  using Shape_MNK = Shape<_64,_192,_8>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout<224, 16>;
-  using CLayout = GMMA::CLayout_64x224;
+  using ALayout = GMMA::ALayout_64x8;
+  using BLayout = GMMA::ABLayout<192,  8>;
+  using CLayout = GMMA::CLayout_64x192;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x240x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x256x8_F32TF32TF32_SS_TN = SM90::GMMA::MMA_64x256x8_F32TF32TF32_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x256x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = half_t;
-  using ValTypeB = half_t;
-  using ValTypeC = float;
-
-  using FrgTypeA = GMMA::smem_desc<tnspA>;
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
-
-  using Shape_MNK = Shape<_64,_240,_16>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout<240, 16>;
-  using CLayout = GMMA::CLayout_64x240;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x240x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = half_t;
-  using ValTypeB = half_t;
-  using ValTypeC = float;
-
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
-
-  using Shape_MNK = Shape<_64,_240,_16>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout<240, 16>;
-  using CLayout = GMMA::CLayout_64x240;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x256x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = half_t;
-  using ValTypeB = half_t;
-  using ValTypeC = float;
-
-  using FrgTypeA = GMMA::smem_desc<tnspA>;
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
-
-  using Shape_MNK = Shape<_64,_256,_16>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout<256, 16>;
-  using CLayout = GMMA::CLayout_64x256;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x256x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = half_t;
-  using ValTypeB = half_t;
-  using ValTypeC = float;
-
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
-
-  using Shape_MNK = Shape<_64,_256,_16>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout<256, 16>;
-  using CLayout = GMMA::CLayout_64x256;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x8x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = bfloat16_t;
-  using ValTypeB = bfloat16_t;
-  using ValTypeC = float;
-
-  using FrgTypeA = GMMA::smem_desc<tnspA>;
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
-
-  using Shape_MNK = Shape<_64,_8,_16>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout<  8, 16>;
-  using CLayout = GMMA::CLayout_64x8;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x8x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = bfloat16_t;
-  using ValTypeB = bfloat16_t;
-  using ValTypeC = float;
-
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
-
-  using Shape_MNK = Shape<_64,_8,_16>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout<  8, 16>;
-  using CLayout = GMMA::CLayout_64x8;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x16x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = bfloat16_t;
-  using ValTypeB = bfloat16_t;
-  using ValTypeC = float;
-
-  using FrgTypeA = GMMA::smem_desc<tnspA>;
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
-
-  using Shape_MNK = Shape<_64,_16,_16>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout< 16, 16>;
-  using CLayout = GMMA::CLayout_64x16;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x16x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = bfloat16_t;
-  using ValTypeB = bfloat16_t;
-  using ValTypeC = float;
-
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
-
-  using Shape_MNK = Shape<_64,_16,_16>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout< 16, 16>;
-  using CLayout = GMMA::CLayout_64x16;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x32x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = bfloat16_t;
-  using ValTypeB = bfloat16_t;
-  using ValTypeC = float;
-
-  using FrgTypeA = GMMA::smem_desc<tnspA>;
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
-
-  using Shape_MNK = Shape<_64,_32,_16>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout< 32, 16>;
-  using CLayout = GMMA::CLayout_64x32;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x32x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = bfloat16_t;
-  using ValTypeB = bfloat16_t;
-  using ValTypeC = float;
-
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
-
-  using Shape_MNK = Shape<_64,_32,_16>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout< 32, 16>;
-  using CLayout = GMMA::CLayout_64x32;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x48x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = bfloat16_t;
-  using ValTypeB = bfloat16_t;
-  using ValTypeC = float;
-
-  using FrgTypeA = GMMA::smem_desc<tnspA>;
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
-
-  using Shape_MNK = Shape<_64,_48,_16>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout< 48, 16>;
-  using CLayout = GMMA::CLayout_64x48;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x48x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = bfloat16_t;
-  using ValTypeB = bfloat16_t;
-  using ValTypeC = float;
-
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
-
-  using Shape_MNK = Shape<_64,_48,_16>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout< 48, 16>;
-  using CLayout = GMMA::CLayout_64x48;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x64x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = bfloat16_t;
-  using ValTypeB = bfloat16_t;
-  using ValTypeC = float;
-
-  using FrgTypeA = GMMA::smem_desc<tnspA>;
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
-
-  using Shape_MNK = Shape<_64,_64,_16>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout< 64, 16>;
-  using CLayout = GMMA::CLayout_64x64;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x64x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = bfloat16_t;
-  using ValTypeB = bfloat16_t;
-  using ValTypeC = float;
-
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
-
-  using Shape_MNK = Shape<_64,_64,_16>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout< 64, 16>;
-  using CLayout = GMMA::CLayout_64x64;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x80x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = bfloat16_t;
-  using ValTypeB = bfloat16_t;
-  using ValTypeC = float;
-
-  using FrgTypeA = GMMA::smem_desc<tnspA>;
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
-
-  using Shape_MNK = Shape<_64,_80,_16>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout< 80, 16>;
-  using CLayout = GMMA::CLayout_64x80;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x80x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = bfloat16_t;
-  using ValTypeB = bfloat16_t;
-  using ValTypeC = float;
-
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
-
-  using Shape_MNK = Shape<_64,_80,_16>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout< 80, 16>;
-  using CLayout = GMMA::CLayout_64x80;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x96x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = bfloat16_t;
-  using ValTypeB = bfloat16_t;
-  using ValTypeC = float;
-
-  using FrgTypeA = GMMA::smem_desc<tnspA>;
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
-
-  using Shape_MNK = Shape<_64,_96,_16>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout< 96, 16>;
-  using CLayout = GMMA::CLayout_64x96;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x96x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = bfloat16_t;
-  using ValTypeB = bfloat16_t;
-  using ValTypeC = float;
-
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
-
-  using Shape_MNK = Shape<_64,_96,_16>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout< 96, 16>;
-  using CLayout = GMMA::CLayout_64x96;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x112x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = bfloat16_t;
-  using ValTypeB = bfloat16_t;
-  using ValTypeC = float;
-
-  using FrgTypeA = GMMA::smem_desc<tnspA>;
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
-
-  using Shape_MNK = Shape<_64,_112,_16>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout<112, 16>;
-  using CLayout = GMMA::CLayout_64x112;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x112x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = bfloat16_t;
-  using ValTypeB = bfloat16_t;
-  using ValTypeC = float;
-
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
-
-  using Shape_MNK = Shape<_64,_112,_16>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout<112, 16>;
-  using CLayout = GMMA::CLayout_64x112;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x128x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = bfloat16_t;
-  using ValTypeB = bfloat16_t;
-  using ValTypeC = float;
-
-  using FrgTypeA = GMMA::smem_desc<tnspA>;
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
-
-  using Shape_MNK = Shape<_64,_128,_16>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout<128, 16>;
-  using CLayout = GMMA::CLayout_64x128;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x128x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = bfloat16_t;
-  using ValTypeB = bfloat16_t;
-  using ValTypeC = float;
-
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
-
-  using Shape_MNK = Shape<_64,_128,_16>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout<128, 16>;
-  using CLayout = GMMA::CLayout_64x128;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x144x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = bfloat16_t;
-  using ValTypeB = bfloat16_t;
-  using ValTypeC = float;
-
-  using FrgTypeA = GMMA::smem_desc<tnspA>;
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
-
-  using Shape_MNK = Shape<_64,_144,_16>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout<144, 16>;
-  using CLayout = GMMA::CLayout_64x144;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x144x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = bfloat16_t;
-  using ValTypeB = bfloat16_t;
-  using ValTypeC = float;
-
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
-
-  using Shape_MNK = Shape<_64,_144,_16>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout<144, 16>;
-  using CLayout = GMMA::CLayout_64x144;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x160x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = bfloat16_t;
-  using ValTypeB = bfloat16_t;
-  using ValTypeC = float;
-
-  using FrgTypeA = GMMA::smem_desc<tnspA>;
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
-
-  using Shape_MNK = Shape<_64,_160,_16>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout<160, 16>;
-  using CLayout = GMMA::CLayout_64x160;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x160x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = bfloat16_t;
-  using ValTypeB = bfloat16_t;
-  using ValTypeC = float;
-
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
-
-  using Shape_MNK = Shape<_64,_160,_16>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout<160, 16>;
-  using CLayout = GMMA::CLayout_64x160;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x176x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = bfloat16_t;
-  using ValTypeB = bfloat16_t;
-  using ValTypeC = float;
-
-  using FrgTypeA = GMMA::smem_desc<tnspA>;
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
-
-  using Shape_MNK = Shape<_64,_176,_16>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout<176, 16>;
-  using CLayout = GMMA::CLayout_64x176;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x176x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = bfloat16_t;
-  using ValTypeB = bfloat16_t;
-  using ValTypeC = float;
-
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
-
-  using Shape_MNK = Shape<_64,_176,_16>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout<176, 16>;
-  using CLayout = GMMA::CLayout_64x176;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x192x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = bfloat16_t;
-  using ValTypeB = bfloat16_t;
-  using ValTypeC = float;
-
-  using FrgTypeA = GMMA::smem_desc<tnspA>;
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
-
-  using Shape_MNK = Shape<_64,_192,_16>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout<192, 16>;
-  using CLayout = GMMA::CLayout_64x192;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x192x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = bfloat16_t;
-  using ValTypeB = bfloat16_t;
-  using ValTypeC = float;
-
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
-
-  using Shape_MNK = Shape<_64,_192,_16>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout<192, 16>;
-  using CLayout = GMMA::CLayout_64x192;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x208x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = bfloat16_t;
-  using ValTypeB = bfloat16_t;
-  using ValTypeC = float;
-
-  using FrgTypeA = GMMA::smem_desc<tnspA>;
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
-
-  using Shape_MNK = Shape<_64,_208,_16>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout<208, 16>;
-  using CLayout = GMMA::CLayout_64x208;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x208x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = bfloat16_t;
-  using ValTypeB = bfloat16_t;
-  using ValTypeC = float;
-
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
-
-  using Shape_MNK = Shape<_64,_208,_16>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout<208, 16>;
-  using CLayout = GMMA::CLayout_64x208;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x224x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = bfloat16_t;
-  using ValTypeB = bfloat16_t;
-  using ValTypeC = float;
-
-  using FrgTypeA = GMMA::smem_desc<tnspA>;
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
-
-  using Shape_MNK = Shape<_64,_224,_16>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout<224, 16>;
-  using CLayout = GMMA::CLayout_64x224;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x224x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = bfloat16_t;
-  using ValTypeB = bfloat16_t;
-  using ValTypeC = float;
-
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
-
-  using Shape_MNK = Shape<_64,_224,_16>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout<224, 16>;
-  using CLayout = GMMA::CLayout_64x224;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x240x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = bfloat16_t;
-  using ValTypeB = bfloat16_t;
-  using ValTypeC = float;
-
-  using FrgTypeA = GMMA::smem_desc<tnspA>;
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
-
-  using Shape_MNK = Shape<_64,_240,_16>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout<240, 16>;
-  using CLayout = GMMA::CLayout_64x240;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x240x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = bfloat16_t;
-  using ValTypeB = bfloat16_t;
-  using ValTypeC = float;
-
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
-
-  using Shape_MNK = Shape<_64,_240,_16>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout<240, 16>;
-  using CLayout = GMMA::CLayout_64x240;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x256x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = bfloat16_t;
-  using ValTypeB = bfloat16_t;
-  using ValTypeC = float;
-
-  using FrgTypeA = GMMA::smem_desc<tnspA>;
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
-
-  using Shape_MNK = Shape<_64,_256,_16>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 16>;
-  using BLayout = GMMA::ABLayout<256, 16>;
-  using CLayout = GMMA::CLayout_64x256;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x256x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = bfloat16_t;
-  using ValTypeB = bfloat16_t;
-  using ValTypeC = float;
-
-  using FrgTypeB = GMMA::smem_desc<tnspB>;
-
-  using Shape_MNK = Shape<_64,_256,_16>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x16;
-  using BLayout = GMMA::ABLayout<256, 16>;
-  using CLayout = GMMA::CLayout_64x256;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x8x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = tfloat32_t;
-  using ValTypeB = tfloat32_t;
-  using ValTypeC = float;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_8,_8>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64,  8>;
-  using BLayout = GMMA::ABLayout<  8,  8>;
-  using CLayout = GMMA::CLayout_64x8;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x8x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = tfloat32_t;
-  using ValTypeB = tfloat32_t;
-  using ValTypeC = float;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_8,_8>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x8;
-  using BLayout = GMMA::ABLayout<  8,  8>;
-  using CLayout = GMMA::CLayout_64x8;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x16x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = tfloat32_t;
-  using ValTypeB = tfloat32_t;
-  using ValTypeC = float;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_16,_8>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64,  8>;
-  using BLayout = GMMA::ABLayout< 16,  8>;
-  using CLayout = GMMA::CLayout_64x16;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x16x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = tfloat32_t;
-  using ValTypeB = tfloat32_t;
-  using ValTypeC = float;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_16,_8>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x8;
-  using BLayout = GMMA::ABLayout< 16,  8>;
-  using CLayout = GMMA::CLayout_64x16;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x32x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = tfloat32_t;
-  using ValTypeB = tfloat32_t;
-  using ValTypeC = float;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_32,_8>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64,  8>;
-  using BLayout = GMMA::ABLayout< 32,  8>;
-  using CLayout = GMMA::CLayout_64x32;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x32x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = tfloat32_t;
-  using ValTypeB = tfloat32_t;
-  using ValTypeC = float;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_32,_8>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x8;
-  using BLayout = GMMA::ABLayout< 32,  8>;
-  using CLayout = GMMA::CLayout_64x32;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x48x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = tfloat32_t;
-  using ValTypeB = tfloat32_t;
-  using ValTypeC = float;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_48,_8>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64,  8>;
-  using BLayout = GMMA::ABLayout< 48,  8>;
-  using CLayout = GMMA::CLayout_64x48;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x48x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = tfloat32_t;
-  using ValTypeB = tfloat32_t;
-  using ValTypeC = float;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_48,_8>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x8;
-  using BLayout = GMMA::ABLayout< 48,  8>;
-  using CLayout = GMMA::CLayout_64x48;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x64x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = tfloat32_t;
-  using ValTypeB = tfloat32_t;
-  using ValTypeC = float;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_64,_8>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64,  8>;
-  using BLayout = GMMA::ABLayout< 64,  8>;
-  using CLayout = GMMA::CLayout_64x64;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x64x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = tfloat32_t;
-  using ValTypeB = tfloat32_t;
-  using ValTypeC = float;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_64,_8>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x8;
-  using BLayout = GMMA::ABLayout< 64,  8>;
-  using CLayout = GMMA::CLayout_64x64;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x80x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = tfloat32_t;
-  using ValTypeB = tfloat32_t;
-  using ValTypeC = float;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_80,_8>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64,  8>;
-  using BLayout = GMMA::ABLayout< 80,  8>;
-  using CLayout = GMMA::CLayout_64x80;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x80x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = tfloat32_t;
-  using ValTypeB = tfloat32_t;
-  using ValTypeC = float;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_80,_8>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x8;
-  using BLayout = GMMA::ABLayout< 80,  8>;
-  using CLayout = GMMA::CLayout_64x80;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x96x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = tfloat32_t;
-  using ValTypeB = tfloat32_t;
-  using ValTypeC = float;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_96,_8>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64,  8>;
-  using BLayout = GMMA::ABLayout< 96,  8>;
-  using CLayout = GMMA::CLayout_64x96;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x96x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = tfloat32_t;
-  using ValTypeB = tfloat32_t;
-  using ValTypeC = float;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_96,_8>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x8;
-  using BLayout = GMMA::ABLayout< 96,  8>;
-  using CLayout = GMMA::CLayout_64x96;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x112x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = tfloat32_t;
-  using ValTypeB = tfloat32_t;
-  using ValTypeC = float;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_112,_8>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64,  8>;
-  using BLayout = GMMA::ABLayout<112,  8>;
-  using CLayout = GMMA::CLayout_64x112;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x112x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = tfloat32_t;
-  using ValTypeB = tfloat32_t;
-  using ValTypeC = float;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_112,_8>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x8;
-  using BLayout = GMMA::ABLayout<112,  8>;
-  using CLayout = GMMA::CLayout_64x112;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x128x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = tfloat32_t;
-  using ValTypeB = tfloat32_t;
-  using ValTypeC = float;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_128,_8>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64,  8>;
-  using BLayout = GMMA::ABLayout<128,  8>;
-  using CLayout = GMMA::CLayout_64x128;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x128x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = tfloat32_t;
-  using ValTypeB = tfloat32_t;
-  using ValTypeC = float;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_128,_8>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x8;
-  using BLayout = GMMA::ABLayout<128,  8>;
-  using CLayout = GMMA::CLayout_64x128;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x144x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = tfloat32_t;
-  using ValTypeB = tfloat32_t;
-  using ValTypeC = float;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_144,_8>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64,  8>;
-  using BLayout = GMMA::ABLayout<144,  8>;
-  using CLayout = GMMA::CLayout_64x144;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x144x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = tfloat32_t;
-  using ValTypeB = tfloat32_t;
-  using ValTypeC = float;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_144,_8>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x8;
-  using BLayout = GMMA::ABLayout<144,  8>;
-  using CLayout = GMMA::CLayout_64x144;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x160x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = tfloat32_t;
-  using ValTypeB = tfloat32_t;
-  using ValTypeC = float;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_160,_8>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64,  8>;
-  using BLayout = GMMA::ABLayout<160,  8>;
-  using CLayout = GMMA::CLayout_64x160;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x160x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = tfloat32_t;
-  using ValTypeB = tfloat32_t;
-  using ValTypeC = float;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_160,_8>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x8;
-  using BLayout = GMMA::ABLayout<160,  8>;
-  using CLayout = GMMA::CLayout_64x160;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x176x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = tfloat32_t;
-  using ValTypeB = tfloat32_t;
-  using ValTypeC = float;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_176,_8>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64,  8>;
-  using BLayout = GMMA::ABLayout<176,  8>;
-  using CLayout = GMMA::CLayout_64x176;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x176x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = tfloat32_t;
-  using ValTypeB = tfloat32_t;
-  using ValTypeC = float;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_176,_8>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x8;
-  using BLayout = GMMA::ABLayout<176,  8>;
-  using CLayout = GMMA::CLayout_64x176;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x192x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = tfloat32_t;
-  using ValTypeB = tfloat32_t;
-  using ValTypeC = float;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_192,_8>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64,  8>;
-  using BLayout = GMMA::ABLayout<192,  8>;
-  using CLayout = GMMA::CLayout_64x192;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x192x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = tfloat32_t;
-  using ValTypeB = tfloat32_t;
-  using ValTypeC = float;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_192,_8>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x8;
-  using BLayout = GMMA::ABLayout<192,  8>;
-  using CLayout = GMMA::CLayout_64x192;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x208x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = tfloat32_t;
-  using ValTypeB = tfloat32_t;
-  using ValTypeC = float;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_208,_8>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64,  8>;
-  using BLayout = GMMA::ABLayout<208,  8>;
-  using CLayout = GMMA::CLayout_64x208;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x208x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = tfloat32_t;
-  using ValTypeB = tfloat32_t;
-  using ValTypeC = float;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_208,_8>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x8;
-  using BLayout = GMMA::ABLayout<208,  8>;
-  using CLayout = GMMA::CLayout_64x208;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x224x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = tfloat32_t;
-  using ValTypeB = tfloat32_t;
-  using ValTypeC = float;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_224,_8>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64,  8>;
-  using BLayout = GMMA::ABLayout<224,  8>;
-  using CLayout = GMMA::CLayout_64x224;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x224x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = tfloat32_t;
-  using ValTypeB = tfloat32_t;
-  using ValTypeC = float;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_224,_8>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x8;
-  using BLayout = GMMA::ABLayout<224,  8>;
-  using CLayout = GMMA::CLayout_64x224;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x240x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = tfloat32_t;
-  using ValTypeB = tfloat32_t;
-  using ValTypeC = float;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_240,_8>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64,  8>;
-  using BLayout = GMMA::ABLayout<240,  8>;
-  using CLayout = GMMA::CLayout_64x240;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x240x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = tfloat32_t;
-  using ValTypeB = tfloat32_t;
-  using ValTypeC = float;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_240,_8>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x8;
-  using BLayout = GMMA::ABLayout<240,  8>;
-  using CLayout = GMMA::CLayout_64x240;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x256x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = tfloat32_t;
-  using ValTypeB = tfloat32_t;
-  using ValTypeC = float;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_256,_8>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64,  8>;
-  using BLayout = GMMA::ABLayout<256,  8>;
-  using CLayout = GMMA::CLayout_64x256;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x256x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = tfloat32_t;
-  using ValTypeB = tfloat32_t;
-  using ValTypeC = float;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_256,_8>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x8;
-  using BLayout = GMMA::ABLayout<256,  8>;
-  using CLayout = GMMA::CLayout_64x256;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x8x32_S32S8S8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_8,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<  8, 32>;
-  using CLayout = GMMA::CLayout_64x8;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x8x32_S32S8S8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_8,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<  8, 32>;
-  using CLayout = GMMA::CLayout_64x8;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x16x32_S32S8S8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_16,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 16, 32>;
-  using CLayout = GMMA::CLayout_64x16;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x16x32_S32S8S8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_16,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 16, 32>;
-  using CLayout = GMMA::CLayout_64x16;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x32x32_S32S8S8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_32,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 32, 32>;
-  using CLayout = GMMA::CLayout_64x32;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x32x32_S32S8S8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_32,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 32, 32>;
-  using CLayout = GMMA::CLayout_64x32;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x48x32_S32S8S8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_48,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 48, 32>;
-  using CLayout = GMMA::CLayout_64x48;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x48x32_S32S8S8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_48,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 48, 32>;
-  using CLayout = GMMA::CLayout_64x48;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x64x32_S32S8S8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_64,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 64, 32>;
-  using CLayout = GMMA::CLayout_64x64;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x64x32_S32S8S8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_64,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 64, 32>;
-  using CLayout = GMMA::CLayout_64x64;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x80x32_S32S8S8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_80,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 80, 32>;
-  using CLayout = GMMA::CLayout_64x80;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x80x32_S32S8S8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_80,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 80, 32>;
-  using CLayout = GMMA::CLayout_64x80;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x96x32_S32S8S8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_96,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 96, 32>;
-  using CLayout = GMMA::CLayout_64x96;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x96x32_S32S8S8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_96,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 96, 32>;
-  using CLayout = GMMA::CLayout_64x96;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x112x32_S32S8S8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_112,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<112, 32>;
-  using CLayout = GMMA::CLayout_64x112;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x112x32_S32S8S8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_112,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<112, 32>;
-  using CLayout = GMMA::CLayout_64x112;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x128x32_S32S8S8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_128,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<128, 32>;
-  using CLayout = GMMA::CLayout_64x128;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x128x32_S32S8S8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_128,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<128, 32>;
-  using CLayout = GMMA::CLayout_64x128;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x144x32_S32S8S8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_144,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<144, 32>;
-  using CLayout = GMMA::CLayout_64x144;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x144x32_S32S8S8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_144,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<144, 32>;
-  using CLayout = GMMA::CLayout_64x144;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x160x32_S32S8S8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_160,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<160, 32>;
-  using CLayout = GMMA::CLayout_64x160;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x160x32_S32S8S8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_160,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<160, 32>;
-  using CLayout = GMMA::CLayout_64x160;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x176x32_S32S8S8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_176,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<176, 32>;
-  using CLayout = GMMA::CLayout_64x176;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x176x32_S32S8S8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_176,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<176, 32>;
-  using CLayout = GMMA::CLayout_64x176;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x192x32_S32S8S8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_192,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<192, 32>;
-  using CLayout = GMMA::CLayout_64x192;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x192x32_S32S8S8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_192,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<192, 32>;
-  using CLayout = GMMA::CLayout_64x192;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x208x32_S32S8S8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_208,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<208, 32>;
-  using CLayout = GMMA::CLayout_64x208;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x208x32_S32S8S8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_208,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<208, 32>;
-  using CLayout = GMMA::CLayout_64x208;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x224x32_S32S8S8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_224,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<224, 32>;
-  using CLayout = GMMA::CLayout_64x224;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x224x32_S32S8S8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_224,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<224, 32>;
-  using CLayout = GMMA::CLayout_64x224;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x240x32_S32S8S8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_240,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<240, 32>;
-  using CLayout = GMMA::CLayout_64x240;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x240x32_S32S8S8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_240,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<240, 32>;
-  using CLayout = GMMA::CLayout_64x240;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x256x32_S32S8S8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_256,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<256, 32>;
-  using CLayout = GMMA::CLayout_64x256;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x256x32_S32S8S8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_256,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<256, 32>;
-  using CLayout = GMMA::CLayout_64x256;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x8x32_S32S8S8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_8,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<  8, 32>;
-  using CLayout = GMMA::CLayout_64x8;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x8x32_S32S8S8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_8,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<  8, 32>;
-  using CLayout = GMMA::CLayout_64x8;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x16x32_S32S8S8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_16,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 16, 32>;
-  using CLayout = GMMA::CLayout_64x16;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x16x32_S32S8S8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_16,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 16, 32>;
-  using CLayout = GMMA::CLayout_64x16;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x32x32_S32S8S8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_32,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 32, 32>;
-  using CLayout = GMMA::CLayout_64x32;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x32x32_S32S8S8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_32,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 32, 32>;
-  using CLayout = GMMA::CLayout_64x32;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x48x32_S32S8S8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_48,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 48, 32>;
-  using CLayout = GMMA::CLayout_64x48;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x48x32_S32S8S8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_48,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 48, 32>;
-  using CLayout = GMMA::CLayout_64x48;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x64x32_S32S8S8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_64,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 64, 32>;
-  using CLayout = GMMA::CLayout_64x64;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x64x32_S32S8S8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_64,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 64, 32>;
-  using CLayout = GMMA::CLayout_64x64;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x80x32_S32S8S8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_80,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 80, 32>;
-  using CLayout = GMMA::CLayout_64x80;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x80x32_S32S8S8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_80,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 80, 32>;
-  using CLayout = GMMA::CLayout_64x80;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x96x32_S32S8S8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_96,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 96, 32>;
-  using CLayout = GMMA::CLayout_64x96;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x96x32_S32S8S8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_96,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 96, 32>;
-  using CLayout = GMMA::CLayout_64x96;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x112x32_S32S8S8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_112,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<112, 32>;
-  using CLayout = GMMA::CLayout_64x112;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x112x32_S32S8S8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_112,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<112, 32>;
-  using CLayout = GMMA::CLayout_64x112;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x128x32_S32S8S8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_128,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<128, 32>;
-  using CLayout = GMMA::CLayout_64x128;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x128x32_S32S8S8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_128,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<128, 32>;
-  using CLayout = GMMA::CLayout_64x128;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x144x32_S32S8S8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_144,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<144, 32>;
-  using CLayout = GMMA::CLayout_64x144;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x144x32_S32S8S8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_144,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<144, 32>;
-  using CLayout = GMMA::CLayout_64x144;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x160x32_S32S8S8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_160,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<160, 32>;
-  using CLayout = GMMA::CLayout_64x160;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x160x32_S32S8S8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_160,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<160, 32>;
-  using CLayout = GMMA::CLayout_64x160;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x176x32_S32S8S8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_176,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<176, 32>;
-  using CLayout = GMMA::CLayout_64x176;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x176x32_S32S8S8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_176,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<176, 32>;
-  using CLayout = GMMA::CLayout_64x176;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x192x32_S32S8S8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_192,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<192, 32>;
-  using CLayout = GMMA::CLayout_64x192;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x192x32_S32S8S8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_192,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<192, 32>;
-  using CLayout = GMMA::CLayout_64x192;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x208x32_S32S8S8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_208,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<208, 32>;
-  using CLayout = GMMA::CLayout_64x208;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x208x32_S32S8S8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_208,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<208, 32>;
-  using CLayout = GMMA::CLayout_64x208;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x224x32_S32S8S8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_224,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<224, 32>;
-  using CLayout = GMMA::CLayout_64x224;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x224x32_S32S8S8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_224,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<224, 32>;
-  using CLayout = GMMA::CLayout_64x224;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x240x32_S32S8S8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_240,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<240, 32>;
-  using CLayout = GMMA::CLayout_64x240;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x240x32_S32S8S8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_240,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<240, 32>;
-  using CLayout = GMMA::CLayout_64x240;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x256x32_S32S8S8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_256,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<256, 32>;
-  using CLayout = GMMA::CLayout_64x256;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x256x32_S32S8S8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_256,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<256, 32>;
-  using CLayout = GMMA::CLayout_64x256;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x8x32_S32S8U8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_8,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<  8, 32>;
-  using CLayout = GMMA::CLayout_64x8;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x8x32_S32S8U8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_8,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<  8, 32>;
-  using CLayout = GMMA::CLayout_64x8;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x16x32_S32S8U8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_16,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 16, 32>;
-  using CLayout = GMMA::CLayout_64x16;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x16x32_S32S8U8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_16,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 16, 32>;
-  using CLayout = GMMA::CLayout_64x16;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x32x32_S32S8U8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_32,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 32, 32>;
-  using CLayout = GMMA::CLayout_64x32;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x32x32_S32S8U8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_32,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 32, 32>;
-  using CLayout = GMMA::CLayout_64x32;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x48x32_S32S8U8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_48,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 48, 32>;
-  using CLayout = GMMA::CLayout_64x48;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x48x32_S32S8U8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_48,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 48, 32>;
-  using CLayout = GMMA::CLayout_64x48;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x64x32_S32S8U8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_64,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 64, 32>;
-  using CLayout = GMMA::CLayout_64x64;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x64x32_S32S8U8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_64,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 64, 32>;
-  using CLayout = GMMA::CLayout_64x64;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x80x32_S32S8U8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_80,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 80, 32>;
-  using CLayout = GMMA::CLayout_64x80;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x80x32_S32S8U8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_80,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 80, 32>;
-  using CLayout = GMMA::CLayout_64x80;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x96x32_S32S8U8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_96,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 96, 32>;
-  using CLayout = GMMA::CLayout_64x96;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x96x32_S32S8U8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_96,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 96, 32>;
-  using CLayout = GMMA::CLayout_64x96;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x112x32_S32S8U8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_112,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<112, 32>;
-  using CLayout = GMMA::CLayout_64x112;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x112x32_S32S8U8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_112,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<112, 32>;
-  using CLayout = GMMA::CLayout_64x112;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x128x32_S32S8U8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_128,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<128, 32>;
-  using CLayout = GMMA::CLayout_64x128;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x128x32_S32S8U8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_128,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<128, 32>;
-  using CLayout = GMMA::CLayout_64x128;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x144x32_S32S8U8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_144,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<144, 32>;
-  using CLayout = GMMA::CLayout_64x144;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x144x32_S32S8U8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_144,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<144, 32>;
-  using CLayout = GMMA::CLayout_64x144;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x160x32_S32S8U8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_160,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<160, 32>;
-  using CLayout = GMMA::CLayout_64x160;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x160x32_S32S8U8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_160,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<160, 32>;
-  using CLayout = GMMA::CLayout_64x160;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x176x32_S32S8U8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_176,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<176, 32>;
-  using CLayout = GMMA::CLayout_64x176;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x176x32_S32S8U8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_176,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<176, 32>;
-  using CLayout = GMMA::CLayout_64x176;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x192x32_S32S8U8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_192,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<192, 32>;
-  using CLayout = GMMA::CLayout_64x192;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x192x32_S32S8U8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_192,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<192, 32>;
-  using CLayout = GMMA::CLayout_64x192;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x208x32_S32S8U8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_208,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<208, 32>;
-  using CLayout = GMMA::CLayout_64x208;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x208x32_S32S8U8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_208,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<208, 32>;
-  using CLayout = GMMA::CLayout_64x208;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x224x32_S32S8U8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_224,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<224, 32>;
-  using CLayout = GMMA::CLayout_64x224;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x224x32_S32S8U8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_224,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<224, 32>;
-  using CLayout = GMMA::CLayout_64x224;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x240x32_S32S8U8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_240,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<240, 32>;
-  using CLayout = GMMA::CLayout_64x240;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x240x32_S32S8U8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_240,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<240, 32>;
-  using CLayout = GMMA::CLayout_64x240;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x256x32_S32S8U8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_256,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<256, 32>;
-  using CLayout = GMMA::CLayout_64x256;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x256x32_S32S8U8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_256,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<256, 32>;
-  using CLayout = GMMA::CLayout_64x256;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x8x32_S32S8U8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_8,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<  8, 32>;
-  using CLayout = GMMA::CLayout_64x8;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x8x32_S32S8U8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_8,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<  8, 32>;
-  using CLayout = GMMA::CLayout_64x8;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x16x32_S32S8U8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_16,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 16, 32>;
-  using CLayout = GMMA::CLayout_64x16;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x16x32_S32S8U8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_16,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 16, 32>;
-  using CLayout = GMMA::CLayout_64x16;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x32x32_S32S8U8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_32,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 32, 32>;
-  using CLayout = GMMA::CLayout_64x32;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x32x32_S32S8U8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_32,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 32, 32>;
-  using CLayout = GMMA::CLayout_64x32;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x48x32_S32S8U8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_48,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 48, 32>;
-  using CLayout = GMMA::CLayout_64x48;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x48x32_S32S8U8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_48,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 48, 32>;
-  using CLayout = GMMA::CLayout_64x48;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x64x32_S32S8U8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_64,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 64, 32>;
-  using CLayout = GMMA::CLayout_64x64;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x64x32_S32S8U8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_64,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 64, 32>;
-  using CLayout = GMMA::CLayout_64x64;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x80x32_S32S8U8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_80,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 80, 32>;
-  using CLayout = GMMA::CLayout_64x80;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x80x32_S32S8U8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_80,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 80, 32>;
-  using CLayout = GMMA::CLayout_64x80;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x96x32_S32S8U8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_96,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 96, 32>;
-  using CLayout = GMMA::CLayout_64x96;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x96x32_S32S8U8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_96,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 96, 32>;
-  using CLayout = GMMA::CLayout_64x96;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x112x32_S32S8U8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_112,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<112, 32>;
-  using CLayout = GMMA::CLayout_64x112;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x112x32_S32S8U8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_112,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<112, 32>;
-  using CLayout = GMMA::CLayout_64x112;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x128x32_S32S8U8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_128,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<128, 32>;
-  using CLayout = GMMA::CLayout_64x128;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x128x32_S32S8U8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_128,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<128, 32>;
-  using CLayout = GMMA::CLayout_64x128;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x144x32_S32S8U8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_144,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<144, 32>;
-  using CLayout = GMMA::CLayout_64x144;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x144x32_S32S8U8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_144,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<144, 32>;
-  using CLayout = GMMA::CLayout_64x144;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x160x32_S32S8U8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_160,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<160, 32>;
-  using CLayout = GMMA::CLayout_64x160;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x160x32_S32S8U8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_160,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<160, 32>;
-  using CLayout = GMMA::CLayout_64x160;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x176x32_S32S8U8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_176,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<176, 32>;
-  using CLayout = GMMA::CLayout_64x176;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x176x32_S32S8U8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_176,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<176, 32>;
-  using CLayout = GMMA::CLayout_64x176;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x192x32_S32S8U8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_192,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<192, 32>;
-  using CLayout = GMMA::CLayout_64x192;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x192x32_S32S8U8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_192,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<192, 32>;
-  using CLayout = GMMA::CLayout_64x192;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x208x32_S32S8U8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_208,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<208, 32>;
-  using CLayout = GMMA::CLayout_64x208;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x208x32_S32S8U8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_208,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<208, 32>;
-  using CLayout = GMMA::CLayout_64x208;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x224x32_S32S8U8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_224,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<224, 32>;
-  using CLayout = GMMA::CLayout_64x224;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x224x32_S32S8U8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_224,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<224, 32>;
-  using CLayout = GMMA::CLayout_64x224;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x240x32_S32S8U8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_240,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<240, 32>;
-  using CLayout = GMMA::CLayout_64x240;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x240x32_S32S8U8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_240,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<240, 32>;
-  using CLayout = GMMA::CLayout_64x240;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x256x32_S32S8U8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_256,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<256, 32>;
-  using CLayout = GMMA::CLayout_64x256;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x256x32_S32S8U8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = int8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_256,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<256, 32>;
-  using CLayout = GMMA::CLayout_64x256;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x8x32_S32U8S8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_8,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<  8, 32>;
-  using CLayout = GMMA::CLayout_64x8;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x8x32_S32U8S8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_8,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<  8, 32>;
-  using CLayout = GMMA::CLayout_64x8;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x16x32_S32U8S8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_16,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 16, 32>;
-  using CLayout = GMMA::CLayout_64x16;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x16x32_S32U8S8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_16,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 16, 32>;
-  using CLayout = GMMA::CLayout_64x16;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x32x32_S32U8S8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_32,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 32, 32>;
-  using CLayout = GMMA::CLayout_64x32;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x32x32_S32U8S8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_32,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 32, 32>;
-  using CLayout = GMMA::CLayout_64x32;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x48x32_S32U8S8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_48,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 48, 32>;
-  using CLayout = GMMA::CLayout_64x48;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x48x32_S32U8S8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_48,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 48, 32>;
-  using CLayout = GMMA::CLayout_64x48;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x64x32_S32U8S8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_64,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 64, 32>;
-  using CLayout = GMMA::CLayout_64x64;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x64x32_S32U8S8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_64,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 64, 32>;
-  using CLayout = GMMA::CLayout_64x64;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x80x32_S32U8S8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_80,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 80, 32>;
-  using CLayout = GMMA::CLayout_64x80;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x80x32_S32U8S8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_80,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 80, 32>;
-  using CLayout = GMMA::CLayout_64x80;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x96x32_S32U8S8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_96,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 96, 32>;
-  using CLayout = GMMA::CLayout_64x96;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x96x32_S32U8S8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_96,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 96, 32>;
-  using CLayout = GMMA::CLayout_64x96;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x112x32_S32U8S8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_112,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<112, 32>;
-  using CLayout = GMMA::CLayout_64x112;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x112x32_S32U8S8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_112,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<112, 32>;
-  using CLayout = GMMA::CLayout_64x112;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x128x32_S32U8S8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_128,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<128, 32>;
-  using CLayout = GMMA::CLayout_64x128;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x128x32_S32U8S8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_128,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<128, 32>;
-  using CLayout = GMMA::CLayout_64x128;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x144x32_S32U8S8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_144,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<144, 32>;
-  using CLayout = GMMA::CLayout_64x144;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x144x32_S32U8S8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_144,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<144, 32>;
-  using CLayout = GMMA::CLayout_64x144;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x160x32_S32U8S8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_160,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<160, 32>;
-  using CLayout = GMMA::CLayout_64x160;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x160x32_S32U8S8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_160,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<160, 32>;
-  using CLayout = GMMA::CLayout_64x160;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x176x32_S32U8S8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_176,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<176, 32>;
-  using CLayout = GMMA::CLayout_64x176;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x176x32_S32U8S8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_176,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<176, 32>;
-  using CLayout = GMMA::CLayout_64x176;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x192x32_S32U8S8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_192,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<192, 32>;
-  using CLayout = GMMA::CLayout_64x192;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x192x32_S32U8S8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_192,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<192, 32>;
-  using CLayout = GMMA::CLayout_64x192;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x208x32_S32U8S8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_208,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<208, 32>;
-  using CLayout = GMMA::CLayout_64x208;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x208x32_S32U8S8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_208,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<208, 32>;
-  using CLayout = GMMA::CLayout_64x208;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x224x32_S32U8S8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_224,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<224, 32>;
-  using CLayout = GMMA::CLayout_64x224;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x224x32_S32U8S8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_224,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<224, 32>;
-  using CLayout = GMMA::CLayout_64x224;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x240x32_S32U8S8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_240,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<240, 32>;
-  using CLayout = GMMA::CLayout_64x240;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x240x32_S32U8S8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_240,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<240, 32>;
-  using CLayout = GMMA::CLayout_64x240;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x256x32_S32U8S8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_256,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<256, 32>;
-  using CLayout = GMMA::CLayout_64x256;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x256x32_S32U8S8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_256,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<256, 32>;
-  using CLayout = GMMA::CLayout_64x256;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x8x32_S32U8S8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_8,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<  8, 32>;
-  using CLayout = GMMA::CLayout_64x8;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x8x32_S32U8S8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_8,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<  8, 32>;
-  using CLayout = GMMA::CLayout_64x8;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x16x32_S32U8S8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_16,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 16, 32>;
-  using CLayout = GMMA::CLayout_64x16;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x16x32_S32U8S8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_16,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 16, 32>;
-  using CLayout = GMMA::CLayout_64x16;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x32x32_S32U8S8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_32,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 32, 32>;
-  using CLayout = GMMA::CLayout_64x32;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x32x32_S32U8S8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_32,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 32, 32>;
-  using CLayout = GMMA::CLayout_64x32;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x48x32_S32U8S8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_48,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 48, 32>;
-  using CLayout = GMMA::CLayout_64x48;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x48x32_S32U8S8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_48,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 48, 32>;
-  using CLayout = GMMA::CLayout_64x48;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x64x32_S32U8S8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_64,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 64, 32>;
-  using CLayout = GMMA::CLayout_64x64;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x64x32_S32U8S8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_64,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 64, 32>;
-  using CLayout = GMMA::CLayout_64x64;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x80x32_S32U8S8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_80,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 80, 32>;
-  using CLayout = GMMA::CLayout_64x80;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x80x32_S32U8S8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_80,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 80, 32>;
-  using CLayout = GMMA::CLayout_64x80;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x96x32_S32U8S8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_96,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 96, 32>;
-  using CLayout = GMMA::CLayout_64x96;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x96x32_S32U8S8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_96,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 96, 32>;
-  using CLayout = GMMA::CLayout_64x96;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x112x32_S32U8S8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_112,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<112, 32>;
-  using CLayout = GMMA::CLayout_64x112;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x112x32_S32U8S8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_112,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<112, 32>;
-  using CLayout = GMMA::CLayout_64x112;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x128x32_S32U8S8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_128,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<128, 32>;
-  using CLayout = GMMA::CLayout_64x128;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x128x32_S32U8S8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_128,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<128, 32>;
-  using CLayout = GMMA::CLayout_64x128;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x144x32_S32U8S8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_144,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<144, 32>;
-  using CLayout = GMMA::CLayout_64x144;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x144x32_S32U8S8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_144,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<144, 32>;
-  using CLayout = GMMA::CLayout_64x144;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x160x32_S32U8S8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_160,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<160, 32>;
-  using CLayout = GMMA::CLayout_64x160;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x160x32_S32U8S8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_160,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<160, 32>;
-  using CLayout = GMMA::CLayout_64x160;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x176x32_S32U8S8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_176,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<176, 32>;
-  using CLayout = GMMA::CLayout_64x176;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x176x32_S32U8S8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_176,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<176, 32>;
-  using CLayout = GMMA::CLayout_64x176;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x192x32_S32U8S8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_192,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<192, 32>;
-  using CLayout = GMMA::CLayout_64x192;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x192x32_S32U8S8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_192,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<192, 32>;
-  using CLayout = GMMA::CLayout_64x192;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x208x32_S32U8S8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_208,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<208, 32>;
-  using CLayout = GMMA::CLayout_64x208;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x208x32_S32U8S8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_208,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<208, 32>;
-  using CLayout = GMMA::CLayout_64x208;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x224x32_S32U8S8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_224,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<224, 32>;
-  using CLayout = GMMA::CLayout_64x224;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x224x32_S32U8S8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_224,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<224, 32>;
-  using CLayout = GMMA::CLayout_64x224;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x240x32_S32U8S8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_240,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<240, 32>;
-  using CLayout = GMMA::CLayout_64x240;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x240x32_S32U8S8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_240,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<240, 32>;
-  using CLayout = GMMA::CLayout_64x240;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x256x32_S32U8S8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_256,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<256, 32>;
-  using CLayout = GMMA::CLayout_64x256;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x256x32_S32U8S8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = int8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_256,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<256, 32>;
-  using CLayout = GMMA::CLayout_64x256;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x8x32_S32U8U8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_8,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<  8, 32>;
-  using CLayout = GMMA::CLayout_64x8;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x8x32_S32U8U8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_8,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<  8, 32>;
-  using CLayout = GMMA::CLayout_64x8;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x16x32_S32U8U8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_16,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 16, 32>;
-  using CLayout = GMMA::CLayout_64x16;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x16x32_S32U8U8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_16,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 16, 32>;
-  using CLayout = GMMA::CLayout_64x16;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x32x32_S32U8U8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_32,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 32, 32>;
-  using CLayout = GMMA::CLayout_64x32;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x32x32_S32U8U8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_32,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 32, 32>;
-  using CLayout = GMMA::CLayout_64x32;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x48x32_S32U8U8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_48,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 48, 32>;
-  using CLayout = GMMA::CLayout_64x48;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x48x32_S32U8U8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_48,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 48, 32>;
-  using CLayout = GMMA::CLayout_64x48;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x64x32_S32U8U8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_64,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 64, 32>;
-  using CLayout = GMMA::CLayout_64x64;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x64x32_S32U8U8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_64,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 64, 32>;
-  using CLayout = GMMA::CLayout_64x64;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x80x32_S32U8U8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_80,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 80, 32>;
-  using CLayout = GMMA::CLayout_64x80;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x80x32_S32U8U8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_80,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 80, 32>;
-  using CLayout = GMMA::CLayout_64x80;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x96x32_S32U8U8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_96,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 96, 32>;
-  using CLayout = GMMA::CLayout_64x96;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x96x32_S32U8U8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_96,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 96, 32>;
-  using CLayout = GMMA::CLayout_64x96;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x112x32_S32U8U8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_112,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<112, 32>;
-  using CLayout = GMMA::CLayout_64x112;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x112x32_S32U8U8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_112,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<112, 32>;
-  using CLayout = GMMA::CLayout_64x112;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x128x32_S32U8U8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_128,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<128, 32>;
-  using CLayout = GMMA::CLayout_64x128;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x128x32_S32U8U8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_128,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<128, 32>;
-  using CLayout = GMMA::CLayout_64x128;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x144x32_S32U8U8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_144,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<144, 32>;
-  using CLayout = GMMA::CLayout_64x144;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x144x32_S32U8U8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_144,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<144, 32>;
-  using CLayout = GMMA::CLayout_64x144;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x160x32_S32U8U8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_160,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<160, 32>;
-  using CLayout = GMMA::CLayout_64x160;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x160x32_S32U8U8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_160,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<160, 32>;
-  using CLayout = GMMA::CLayout_64x160;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x176x32_S32U8U8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_176,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<176, 32>;
-  using CLayout = GMMA::CLayout_64x176;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x176x32_S32U8U8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_176,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<176, 32>;
-  using CLayout = GMMA::CLayout_64x176;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x192x32_S32U8U8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_192,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<192, 32>;
-  using CLayout = GMMA::CLayout_64x192;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x192x32_S32U8U8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_192,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<192, 32>;
-  using CLayout = GMMA::CLayout_64x192;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x208x32_S32U8U8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_208,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<208, 32>;
-  using CLayout = GMMA::CLayout_64x208;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x208x32_S32U8U8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_208,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<208, 32>;
-  using CLayout = GMMA::CLayout_64x208;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x224x32_S32U8U8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_224,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<224, 32>;
-  using CLayout = GMMA::CLayout_64x224;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x224x32_S32U8U8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_224,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<224, 32>;
-  using CLayout = GMMA::CLayout_64x224;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x240x32_S32U8U8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_240,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<240, 32>;
-  using CLayout = GMMA::CLayout_64x240;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x240x32_S32U8U8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_240,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<240, 32>;
-  using CLayout = GMMA::CLayout_64x240;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x256x32_S32U8U8_SS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_256,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<256, 32>;
-  using CLayout = GMMA::CLayout_64x256;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x256x32_S32U8U8_SS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_256,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<256, 32>;
-  using CLayout = GMMA::CLayout_64x256;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x8x32_S32U8U8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_8,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<  8, 32>;
-  using CLayout = GMMA::CLayout_64x8;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x8x32_S32U8U8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_8,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<  8, 32>;
-  using CLayout = GMMA::CLayout_64x8;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x16x32_S32U8U8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_16,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 16, 32>;
-  using CLayout = GMMA::CLayout_64x16;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x16x32_S32U8U8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_16,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 16, 32>;
-  using CLayout = GMMA::CLayout_64x16;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x32x32_S32U8U8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_32,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 32, 32>;
-  using CLayout = GMMA::CLayout_64x32;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x32x32_S32U8U8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_32,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 32, 32>;
-  using CLayout = GMMA::CLayout_64x32;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x48x32_S32U8U8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_48,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 48, 32>;
-  using CLayout = GMMA::CLayout_64x48;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x48x32_S32U8U8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_48,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 48, 32>;
-  using CLayout = GMMA::CLayout_64x48;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x64x32_S32U8U8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_64,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 64, 32>;
-  using CLayout = GMMA::CLayout_64x64;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x64x32_S32U8U8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_64,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 64, 32>;
-  using CLayout = GMMA::CLayout_64x64;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x80x32_S32U8U8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_80,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 80, 32>;
-  using CLayout = GMMA::CLayout_64x80;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x80x32_S32U8U8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_80,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 80, 32>;
-  using CLayout = GMMA::CLayout_64x80;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x96x32_S32U8U8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_96,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 96, 32>;
-  using CLayout = GMMA::CLayout_64x96;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x96x32_S32U8U8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_96,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 96, 32>;
-  using CLayout = GMMA::CLayout_64x96;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x112x32_S32U8U8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_112,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<112, 32>;
-  using CLayout = GMMA::CLayout_64x112;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x112x32_S32U8U8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_112,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<112, 32>;
-  using CLayout = GMMA::CLayout_64x112;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x128x32_S32U8U8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_128,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<128, 32>;
-  using CLayout = GMMA::CLayout_64x128;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x128x32_S32U8U8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_128,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<128, 32>;
-  using CLayout = GMMA::CLayout_64x128;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x144x32_S32U8U8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_144,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<144, 32>;
-  using CLayout = GMMA::CLayout_64x144;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x144x32_S32U8U8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_144,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<144, 32>;
-  using CLayout = GMMA::CLayout_64x144;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x160x32_S32U8U8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_160,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<160, 32>;
-  using CLayout = GMMA::CLayout_64x160;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x160x32_S32U8U8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_160,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<160, 32>;
-  using CLayout = GMMA::CLayout_64x160;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x176x32_S32U8U8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_176,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<176, 32>;
-  using CLayout = GMMA::CLayout_64x176;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x176x32_S32U8U8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_176,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<176, 32>;
-  using CLayout = GMMA::CLayout_64x176;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x192x32_S32U8U8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_192,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<192, 32>;
-  using CLayout = GMMA::CLayout_64x192;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x192x32_S32U8U8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_192,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<192, 32>;
-  using CLayout = GMMA::CLayout_64x192;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x208x32_S32U8U8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_208,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<208, 32>;
-  using CLayout = GMMA::CLayout_64x208;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x208x32_S32U8U8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_208,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<208, 32>;
-  using CLayout = GMMA::CLayout_64x208;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x224x32_S32U8U8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_224,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<224, 32>;
-  using CLayout = GMMA::CLayout_64x224;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x224x32_S32U8U8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_224,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<224, 32>;
-  using CLayout = GMMA::CLayout_64x224;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x240x32_S32U8U8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_240,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<240, 32>;
-  using CLayout = GMMA::CLayout_64x240;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <>
-struct MMA_Traits<SM90_64x240x32_S32U8U8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_240,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<240, 32>;
-  using CLayout = GMMA::CLayout_64x240;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x256x32_S32U8U8_RS_TN>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_256,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<256, 32>;
-  using CLayout = GMMA::CLayout_64x256;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <>
-struct MMA_Traits<SM90_64x256x32_S32U8U8_RS_TN_SATURATE>
-{
-  using ValTypeD = int32_t;
-  using ValTypeA = uint8_t;
-  using ValTypeB = uint8_t;
-  using ValTypeC = int32_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_256,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<256, 32>;
-  using CLayout = GMMA::CLayout_64x256;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x8x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = half_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_8,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<  8, 32>;
-  using CLayout = GMMA::CLayout_64x8;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x8x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = half_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_8,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<  8, 32>;
-  using CLayout = GMMA::CLayout_64x8;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x8x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
   using ValTypeC = float;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_8,_32>;
+  using Shape_MNK = Shape<_64,_256,_8>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<  8, 32>;
-  using CLayout = GMMA::CLayout_64x8;
+  using ALayout = GMMA::ABLayout< 64,  8>;
+  using BLayout = GMMA::ABLayout<256,  8>;
+  using CLayout = GMMA::CLayout_64x256;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x256x8_F32TF32TF32_RS_TN = SM90::GMMA::MMA_64x256x8_F32TF32TF32_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x8x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x256x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
   using ValTypeC = float;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_8,_32>;
+  using Shape_MNK = Shape<_64,_256,_8>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<  8, 32>;
-  using CLayout = GMMA::CLayout_64x8;
+  using ALayout = GMMA::ALayout_64x8;
+  using BLayout = GMMA::ABLayout<256,  8>;
+  using CLayout = GMMA::CLayout_64x256;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x16x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
+
+using SM90_64x8x32_S32S8S8_SS_TN = SM90::GMMA::MMA_64x8x32_S32S8S8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x8x32_S32S8S8_SS_TN>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_16,_32>;
+  using Shape_MNK = Shape<_64,_8,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 16, 32>;
-  using CLayout = GMMA::CLayout_64x16;
+  using BLayout = GMMA::ABLayout<  8, 32>;
+  using CLayout = GMMA::CLayout_64x8;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x16x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
+
+using SM90_64x8x32_S32S8S8_SS_TN_SATURATE = SM90::GMMA::MMA_64x8x32_S32S8S8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x8x32_S32S8S8_SS_TN_SATURATE>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_16,_32>;
+  using Shape_MNK = Shape<_64,_8,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 16, 32>;
-  using CLayout = GMMA::CLayout_64x16;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<  8, 32>;
+  using CLayout = GMMA::CLayout_64x8;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x16x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
+
+using SM90_64x16x32_S32S8S8_SS_TN = SM90::GMMA::MMA_64x16x32_S32S8S8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x16x32_S32S8S8_SS_TN>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
@@ -9804,19 +2411,23 @@ struct MMA_Traits<SM90_64x16x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x16x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
+
+using SM90_64x16x32_S32S8S8_SS_TN_SATURATE = SM90::GMMA::MMA_64x16x32_S32S8S8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x16x32_S32S8S8_SS_TN_SATURATE>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
   using Shape_MNK = Shape<_64,_16,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
+  using ALayout = GMMA::ABLayout< 64, 32>;
   using BLayout = GMMA::ABLayout< 16, 32>;
   using CLayout = GMMA::CLayout_64x16;
 
@@ -9825,13 +2436,16 @@ struct MMA_Traits<SM90_64x16x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x32x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
+
+using SM90_64x32x32_S32S8S8_SS_TN = SM90::GMMA::MMA_64x32x32_S32S8S8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x32x32_S32S8S8_SS_TN>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
@@ -9847,34 +2461,16 @@ struct MMA_Traits<SM90_64x32x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x32x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = half_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_32,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 32, 32>;
-  using CLayout = GMMA::CLayout_64x32;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
 
-////////////////////////////////////////////////////////////////////////////////////////////////////
+using SM90_64x32x32_S32S8S8_SS_TN_SATURATE = SM90::GMMA::MMA_64x32x32_S32S8S8_SS_TN_SATURATE;
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x32x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
+template <>
+struct MMA_Traits<SM90_64x32x32_S32S8S8_SS_TN_SATURATE>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
@@ -9890,495 +2486,512 @@ struct MMA_Traits<SM90_64x32x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x32x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = float;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_32,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 32, 32>;
-  using CLayout = GMMA::CLayout_64x32;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
 
-////////////////////////////////////////////////////////////////////////////////////////////////////
+using SM90_64x64x32_S32S8S8_SS_TN = SM90::GMMA::MMA_64x64x32_S32S8S8_SS_TN;
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x48x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
+template <>
+struct MMA_Traits<SM90_64x64x32_S32S8S8_SS_TN>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_48,_32>;
+  using Shape_MNK = Shape<_64,_64,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 48, 32>;
-  using CLayout = GMMA::CLayout_64x48;
+  using BLayout = GMMA::ABLayout< 64, 32>;
+  using CLayout = GMMA::CLayout_64x64;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x48x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
+
+using SM90_64x64x32_S32S8S8_SS_TN_SATURATE = SM90::GMMA::MMA_64x64x32_S32S8S8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x64x32_S32S8S8_SS_TN_SATURATE>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_48,_32>;
+  using Shape_MNK = Shape<_64,_64,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 48, 32>;
-  using CLayout = GMMA::CLayout_64x48;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 64, 32>;
+  using CLayout = GMMA::CLayout_64x64;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x48x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
+
+using SM90_64x96x32_S32S8S8_SS_TN = SM90::GMMA::MMA_64x96x32_S32S8S8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x96x32_S32S8S8_SS_TN>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_48,_32>;
+  using Shape_MNK = Shape<_64,_96,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 48, 32>;
-  using CLayout = GMMA::CLayout_64x48;
+  using BLayout = GMMA::ABLayout< 96, 32>;
+  using CLayout = GMMA::CLayout_64x96;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x48x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
+
+using SM90_64x96x32_S32S8S8_SS_TN_SATURATE = SM90::GMMA::MMA_64x96x32_S32S8S8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x96x32_S32S8S8_SS_TN_SATURATE>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_48,_32>;
+  using Shape_MNK = Shape<_64,_96,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 48, 32>;
-  using CLayout = GMMA::CLayout_64x48;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 96, 32>;
+  using CLayout = GMMA::CLayout_64x96;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x64x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
+
+using SM90_64x128x32_S32S8S8_SS_TN = SM90::GMMA::MMA_64x128x32_S32S8S8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x128x32_S32S8S8_SS_TN>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_64,_32>;
+  using Shape_MNK = Shape<_64,_128,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 64, 32>;
-  using CLayout = GMMA::CLayout_64x64;
+  using BLayout = GMMA::ABLayout<128, 32>;
+  using CLayout = GMMA::CLayout_64x128;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x64x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
+
+using SM90_64x128x32_S32S8S8_SS_TN_SATURATE = SM90::GMMA::MMA_64x128x32_S32S8S8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x128x32_S32S8S8_SS_TN_SATURATE>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_64,_32>;
+  using Shape_MNK = Shape<_64,_128,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 64, 32>;
-  using CLayout = GMMA::CLayout_64x64;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<128, 32>;
+  using CLayout = GMMA::CLayout_64x128;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x64x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
+
+using SM90_64x192x32_S32S8S8_SS_TN = SM90::GMMA::MMA_64x192x32_S32S8S8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x192x32_S32S8S8_SS_TN>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_64,_32>;
+  using Shape_MNK = Shape<_64,_192,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 64, 32>;
-  using CLayout = GMMA::CLayout_64x64;
+  using BLayout = GMMA::ABLayout<192, 32>;
+  using CLayout = GMMA::CLayout_64x192;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x64x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
+
+using SM90_64x192x32_S32S8S8_SS_TN_SATURATE = SM90::GMMA::MMA_64x192x32_S32S8S8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x192x32_S32S8S8_SS_TN_SATURATE>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_64,_32>;
+  using Shape_MNK = Shape<_64,_192,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 64, 32>;
-  using CLayout = GMMA::CLayout_64x64;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<192, 32>;
+  using CLayout = GMMA::CLayout_64x192;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x80x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
+
+using SM90_64x256x32_S32S8S8_SS_TN = SM90::GMMA::MMA_64x256x32_S32S8S8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x256x32_S32S8S8_SS_TN>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_80,_32>;
+  using Shape_MNK = Shape<_64,_256,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 80, 32>;
-  using CLayout = GMMA::CLayout_64x80;
+  using BLayout = GMMA::ABLayout<256, 32>;
+  using CLayout = GMMA::CLayout_64x256;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x80x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
+
+using SM90_64x256x32_S32S8S8_SS_TN_SATURATE = SM90::GMMA::MMA_64x256x32_S32S8S8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x256x32_S32S8S8_SS_TN_SATURATE>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_80,_32>;
+  using Shape_MNK = Shape<_64,_256,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 80, 32>;
-  using CLayout = GMMA::CLayout_64x80;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<256, 32>;
+  using CLayout = GMMA::CLayout_64x256;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x80x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
+
+using SM90_64x8x32_S32S8S8_RS_TN = SM90::GMMA::MMA_64x8x32_S32S8S8_RS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x8x32_S32S8S8_RS_TN>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_80,_32>;
+  using Shape_MNK = Shape<_64,_8,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 80, 32>;
-  using CLayout = GMMA::CLayout_64x80;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<  8, 32>;
+  using CLayout = GMMA::CLayout_64x8;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x80x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
+
+using SM90_64x8x32_S32S8S8_RS_TN_SATURATE = SM90::GMMA::MMA_64x8x32_S32S8S8_RS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x8x32_S32S8S8_RS_TN_SATURATE>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_80,_32>;
+  using Shape_MNK = Shape<_64,_8,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 80, 32>;
-  using CLayout = GMMA::CLayout_64x80;
+  using BLayout = GMMA::ABLayout<  8, 32>;
+  using CLayout = GMMA::CLayout_64x8;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x96x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
+
+using SM90_64x16x32_S32S8S8_RS_TN = SM90::GMMA::MMA_64x16x32_S32S8S8_RS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x16x32_S32S8S8_RS_TN>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_96,_32>;
+  using Shape_MNK = Shape<_64,_16,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 96, 32>;
-  using CLayout = GMMA::CLayout_64x96;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 16, 32>;
+  using CLayout = GMMA::CLayout_64x16;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x96x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
+
+using SM90_64x16x32_S32S8S8_RS_TN_SATURATE = SM90::GMMA::MMA_64x16x32_S32S8S8_RS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x16x32_S32S8S8_RS_TN_SATURATE>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_96,_32>;
+  using Shape_MNK = Shape<_64,_16,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 96, 32>;
-  using CLayout = GMMA::CLayout_64x96;
+  using BLayout = GMMA::ABLayout< 16, 32>;
+  using CLayout = GMMA::CLayout_64x16;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x96x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
+
+using SM90_64x32x32_S32S8S8_RS_TN = SM90::GMMA::MMA_64x32x32_S32S8S8_RS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x32x32_S32S8S8_RS_TN>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_96,_32>;
+  using Shape_MNK = Shape<_64,_32,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 96, 32>;
-  using CLayout = GMMA::CLayout_64x96;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 32, 32>;
+  using CLayout = GMMA::CLayout_64x32;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x96x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
+
+using SM90_64x32x32_S32S8S8_RS_TN_SATURATE = SM90::GMMA::MMA_64x32x32_S32S8S8_RS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x32x32_S32S8S8_RS_TN_SATURATE>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_96,_32>;
+  using Shape_MNK = Shape<_64,_32,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 96, 32>;
-  using CLayout = GMMA::CLayout_64x96;
+  using BLayout = GMMA::ABLayout< 32, 32>;
+  using CLayout = GMMA::CLayout_64x32;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x112x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
+
+using SM90_64x64x32_S32S8S8_RS_TN = SM90::GMMA::MMA_64x64x32_S32S8S8_RS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x64x32_S32S8S8_RS_TN>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_112,_32>;
+  using Shape_MNK = Shape<_64,_64,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<112, 32>;
-  using CLayout = GMMA::CLayout_64x112;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 64, 32>;
+  using CLayout = GMMA::CLayout_64x64;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x112x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = half_t;
+
+using SM90_64x64x32_S32S8S8_RS_TN_SATURATE = SM90::GMMA::MMA_64x64x32_S32S8S8_RS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x64x32_S32S8S8_RS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_112,_32>;
+  using Shape_MNK = Shape<_64,_64,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<112, 32>;
-  using CLayout = GMMA::CLayout_64x112;
+  using BLayout = GMMA::ABLayout< 64, 32>;
+  using CLayout = GMMA::CLayout_64x64;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x112x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
+
+using SM90_64x96x32_S32S8S8_RS_TN = SM90::GMMA::MMA_64x96x32_S32S8S8_RS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x96x32_S32S8S8_RS_TN>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_112,_32>;
+  using Shape_MNK = Shape<_64,_96,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<112, 32>;
-  using CLayout = GMMA::CLayout_64x112;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 96, 32>;
+  using CLayout = GMMA::CLayout_64x96;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x112x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
+
+using SM90_64x96x32_S32S8S8_RS_TN_SATURATE = SM90::GMMA::MMA_64x96x32_S32S8S8_RS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x96x32_S32S8S8_RS_TN_SATURATE>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_112,_32>;
+  using Shape_MNK = Shape<_64,_96,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<112, 32>;
-  using CLayout = GMMA::CLayout_64x112;
+  using BLayout = GMMA::ABLayout< 96, 32>;
+  using CLayout = GMMA::CLayout_64x96;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x128x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
+
+using SM90_64x128x32_S32S8S8_RS_TN = SM90::GMMA::MMA_64x128x32_S32S8S8_RS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x128x32_S32S8S8_RS_TN>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
   using Shape_MNK = Shape<_64,_128,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ALayout = GMMA::ALayout_64x32;
   using BLayout = GMMA::ABLayout<128, 32>;
   using CLayout = GMMA::CLayout_64x128;
 
@@ -10387,13 +3000,16 @@ struct MMA_Traits<SM90_64x128x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x128x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
+
+using SM90_64x128x32_S32S8S8_RS_TN_SATURATE = SM90::GMMA::MMA_64x128x32_S32S8S8_RS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x128x32_S32S8S8_RS_TN_SATURATE>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
@@ -10408,381 +3024,412 @@ struct MMA_Traits<SM90_64x128x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x128x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
+
+using SM90_64x192x32_S32S8S8_RS_TN = SM90::GMMA::MMA_64x192x32_S32S8S8_RS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x192x32_S32S8S8_RS_TN>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_128,_32>;
+  using Shape_MNK = Shape<_64,_192,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<128, 32>;
-  using CLayout = GMMA::CLayout_64x128;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<192, 32>;
+  using CLayout = GMMA::CLayout_64x192;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x128x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
+
+using SM90_64x192x32_S32S8S8_RS_TN_SATURATE = SM90::GMMA::MMA_64x192x32_S32S8S8_RS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x192x32_S32S8S8_RS_TN_SATURATE>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_128,_32>;
+  using Shape_MNK = Shape<_64,_192,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<128, 32>;
-  using CLayout = GMMA::CLayout_64x128;
+  using BLayout = GMMA::ABLayout<192, 32>;
+  using CLayout = GMMA::CLayout_64x192;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x144x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
+
+using SM90_64x256x32_S32S8S8_RS_TN = SM90::GMMA::MMA_64x256x32_S32S8S8_RS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x256x32_S32S8S8_RS_TN>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_144,_32>;
+  using Shape_MNK = Shape<_64,_256,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<144, 32>;
-  using CLayout = GMMA::CLayout_64x144;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<256, 32>;
+  using CLayout = GMMA::CLayout_64x256;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x144x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
+
+using SM90_64x256x32_S32S8S8_RS_TN_SATURATE = SM90::GMMA::MMA_64x256x32_S32S8S8_RS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x256x32_S32S8S8_RS_TN_SATURATE>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_144,_32>;
+  using Shape_MNK = Shape<_64,_256,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<144, 32>;
-  using CLayout = GMMA::CLayout_64x144;
+  using BLayout = GMMA::ABLayout<256, 32>;
+  using CLayout = GMMA::CLayout_64x256;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x144x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
+
+using SM90_64x8x32_S32S8U8_SS_TN = SM90::GMMA::MMA_64x8x32_S32S8U8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x8x32_S32S8U8_SS_TN>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_144,_32>;
+  using Shape_MNK = Shape<_64,_8,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<144, 32>;
-  using CLayout = GMMA::CLayout_64x144;
+  using BLayout = GMMA::ABLayout<  8, 32>;
+  using CLayout = GMMA::CLayout_64x8;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x144x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
+
+using SM90_64x8x32_S32S8U8_SS_TN_SATURATE = SM90::GMMA::MMA_64x8x32_S32S8U8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x8x32_S32S8U8_SS_TN_SATURATE>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_144,_32>;
+  using Shape_MNK = Shape<_64,_8,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<144, 32>;
-  using CLayout = GMMA::CLayout_64x144;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<  8, 32>;
+  using CLayout = GMMA::CLayout_64x8;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x160x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
+
+using SM90_64x16x32_S32S8U8_SS_TN = SM90::GMMA::MMA_64x16x32_S32S8U8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x16x32_S32S8U8_SS_TN>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_160,_32>;
+  using Shape_MNK = Shape<_64,_16,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<160, 32>;
-  using CLayout = GMMA::CLayout_64x160;
+  using BLayout = GMMA::ABLayout< 16, 32>;
+  using CLayout = GMMA::CLayout_64x16;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x160x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
+
+using SM90_64x16x32_S32S8U8_SS_TN_SATURATE = SM90::GMMA::MMA_64x16x32_S32S8U8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x16x32_S32S8U8_SS_TN_SATURATE>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_160,_32>;
+  using Shape_MNK = Shape<_64,_16,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<160, 32>;
-  using CLayout = GMMA::CLayout_64x160;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 16, 32>;
+  using CLayout = GMMA::CLayout_64x16;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x160x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
+
+using SM90_64x32x32_S32S8U8_SS_TN = SM90::GMMA::MMA_64x32x32_S32S8U8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x32x32_S32S8U8_SS_TN>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_160,_32>;
+  using Shape_MNK = Shape<_64,_32,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<160, 32>;
-  using CLayout = GMMA::CLayout_64x160;
+  using BLayout = GMMA::ABLayout< 32, 32>;
+  using CLayout = GMMA::CLayout_64x32;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x160x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
+
+using SM90_64x32x32_S32S8U8_SS_TN_SATURATE = SM90::GMMA::MMA_64x32x32_S32S8U8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x32x32_S32S8U8_SS_TN_SATURATE>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_160,_32>;
+  using Shape_MNK = Shape<_64,_32,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<160, 32>;
-  using CLayout = GMMA::CLayout_64x160;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 32, 32>;
+  using CLayout = GMMA::CLayout_64x32;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x176x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
+
+using SM90_64x64x32_S32S8U8_SS_TN = SM90::GMMA::MMA_64x64x32_S32S8U8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x64x32_S32S8U8_SS_TN>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_176,_32>;
+  using Shape_MNK = Shape<_64,_64,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<176, 32>;
-  using CLayout = GMMA::CLayout_64x176;
+  using BLayout = GMMA::ABLayout< 64, 32>;
+  using CLayout = GMMA::CLayout_64x64;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x176x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
+
+using SM90_64x64x32_S32S8U8_SS_TN_SATURATE = SM90::GMMA::MMA_64x64x32_S32S8U8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x64x32_S32S8U8_SS_TN_SATURATE>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_176,_32>;
+  using Shape_MNK = Shape<_64,_64,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<176, 32>;
-  using CLayout = GMMA::CLayout_64x176;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 64, 32>;
+  using CLayout = GMMA::CLayout_64x64;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x176x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
+
+using SM90_64x96x32_S32S8U8_SS_TN = SM90::GMMA::MMA_64x96x32_S32S8U8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x96x32_S32S8U8_SS_TN>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_176,_32>;
+  using Shape_MNK = Shape<_64,_96,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<176, 32>;
-  using CLayout = GMMA::CLayout_64x176;
+  using BLayout = GMMA::ABLayout< 96, 32>;
+  using CLayout = GMMA::CLayout_64x96;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x176x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
+
+using SM90_64x96x32_S32S8U8_SS_TN_SATURATE = SM90::GMMA::MMA_64x96x32_S32S8U8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x96x32_S32S8U8_SS_TN_SATURATE>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_176,_32>;
+  using Shape_MNK = Shape<_64,_96,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<176, 32>;
-  using CLayout = GMMA::CLayout_64x176;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 96, 32>;
+  using CLayout = GMMA::CLayout_64x96;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x192x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
+
+using SM90_64x128x32_S32S8U8_SS_TN = SM90::GMMA::MMA_64x128x32_S32S8U8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x128x32_S32S8U8_SS_TN>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_192,_32>;
+  using Shape_MNK = Shape<_64,_128,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<192, 32>;
-  using CLayout = GMMA::CLayout_64x192;
+  using BLayout = GMMA::ABLayout<128, 32>;
+  using CLayout = GMMA::CLayout_64x128;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x192x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
+
+using SM90_64x128x32_S32S8U8_SS_TN_SATURATE = SM90::GMMA::MMA_64x128x32_S32S8U8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x128x32_S32S8U8_SS_TN_SATURATE>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_192,_32>;
+  using Shape_MNK = Shape<_64,_128,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<192, 32>;
-  using CLayout = GMMA::CLayout_64x192;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<128, 32>;
+  using CLayout = GMMA::CLayout_64x128;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x192x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
+
+using SM90_64x192x32_S32S8U8_SS_TN = SM90::GMMA::MMA_64x192x32_S32S8U8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x192x32_S32S8U8_SS_TN>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
@@ -10798,19 +3445,23 @@ struct MMA_Traits<SM90_64x192x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x192x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
+
+using SM90_64x192x32_S32S8U8_SS_TN_SATURATE = SM90::GMMA::MMA_64x192x32_S32S8U8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x192x32_S32S8U8_SS_TN_SATURATE>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
   using Shape_MNK = Shape<_64,_192,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
+  using ALayout = GMMA::ABLayout< 64, 32>;
   using BLayout = GMMA::ABLayout<192, 32>;
   using CLayout = GMMA::CLayout_64x192;
 
@@ -10819,424 +3470,450 @@ struct MMA_Traits<SM90_64x192x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x208x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
+
+using SM90_64x256x32_S32S8U8_SS_TN = SM90::GMMA::MMA_64x256x32_S32S8U8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x256x32_S32S8U8_SS_TN>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_208,_32>;
+  using Shape_MNK = Shape<_64,_256,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<208, 32>;
-  using CLayout = GMMA::CLayout_64x208;
+  using BLayout = GMMA::ABLayout<256, 32>;
+  using CLayout = GMMA::CLayout_64x256;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x208x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
+
+using SM90_64x256x32_S32S8U8_SS_TN_SATURATE = SM90::GMMA::MMA_64x256x32_S32S8U8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x256x32_S32S8U8_SS_TN_SATURATE>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_208,_32>;
+  using Shape_MNK = Shape<_64,_256,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<208, 32>;
-  using CLayout = GMMA::CLayout_64x208;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<256, 32>;
+  using CLayout = GMMA::CLayout_64x256;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x208x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
+
+using SM90_64x8x32_S32S8U8_RS_TN = SM90::GMMA::MMA_64x8x32_S32S8U8_RS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x8x32_S32S8U8_RS_TN>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_208,_32>;
+  using Shape_MNK = Shape<_64,_8,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<208, 32>;
-  using CLayout = GMMA::CLayout_64x208;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<  8, 32>;
+  using CLayout = GMMA::CLayout_64x8;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x208x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
+
+using SM90_64x8x32_S32S8U8_RS_TN_SATURATE = SM90::GMMA::MMA_64x8x32_S32S8U8_RS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x8x32_S32S8U8_RS_TN_SATURATE>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_208,_32>;
+  using Shape_MNK = Shape<_64,_8,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<208, 32>;
-  using CLayout = GMMA::CLayout_64x208;
+  using BLayout = GMMA::ABLayout<  8, 32>;
+  using CLayout = GMMA::CLayout_64x8;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
-////////////////////////////////////////////////////////////////////////////////////////////////////
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x16x32_S32S8U8_RS_TN = SM90::GMMA::MMA_64x16x32_S32S8U8_RS_TN;
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x224x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
+template <>
+struct MMA_Traits<SM90_64x16x32_S32S8U8_RS_TN>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_224,_32>;
+  using Shape_MNK = Shape<_64,_16,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<224, 32>;
-  using CLayout = GMMA::CLayout_64x224;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 16, 32>;
+  using CLayout = GMMA::CLayout_64x16;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x224x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
+
+using SM90_64x16x32_S32S8U8_RS_TN_SATURATE = SM90::GMMA::MMA_64x16x32_S32S8U8_RS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x16x32_S32S8U8_RS_TN_SATURATE>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_224,_32>;
+  using Shape_MNK = Shape<_64,_16,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<224, 32>;
-  using CLayout = GMMA::CLayout_64x224;
+  using BLayout = GMMA::ABLayout< 16, 32>;
+  using CLayout = GMMA::CLayout_64x16;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x224x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
+
+using SM90_64x32x32_S32S8U8_RS_TN = SM90::GMMA::MMA_64x32x32_S32S8U8_RS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x32x32_S32S8U8_RS_TN>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_224,_32>;
+  using Shape_MNK = Shape<_64,_32,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<224, 32>;
-  using CLayout = GMMA::CLayout_64x224;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 32, 32>;
+  using CLayout = GMMA::CLayout_64x32;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x224x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
+
+using SM90_64x32x32_S32S8U8_RS_TN_SATURATE = SM90::GMMA::MMA_64x32x32_S32S8U8_RS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x32x32_S32S8U8_RS_TN_SATURATE>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_224,_32>;
+  using Shape_MNK = Shape<_64,_32,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<224, 32>;
-  using CLayout = GMMA::CLayout_64x224;
+  using BLayout = GMMA::ABLayout< 32, 32>;
+  using CLayout = GMMA::CLayout_64x32;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x240x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
+
+using SM90_64x64x32_S32S8U8_RS_TN = SM90::GMMA::MMA_64x64x32_S32S8U8_RS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x64x32_S32S8U8_RS_TN>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_240,_32>;
+  using Shape_MNK = Shape<_64,_64,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<240, 32>;
-  using CLayout = GMMA::CLayout_64x240;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 64, 32>;
+  using CLayout = GMMA::CLayout_64x64;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x240x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
+
+using SM90_64x64x32_S32S8U8_RS_TN_SATURATE = SM90::GMMA::MMA_64x64x32_S32S8U8_RS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x64x32_S32S8U8_RS_TN_SATURATE>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_240,_32>;
+  using Shape_MNK = Shape<_64,_64,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<240, 32>;
-  using CLayout = GMMA::CLayout_64x240;
+  using BLayout = GMMA::ABLayout< 64, 32>;
+  using CLayout = GMMA::CLayout_64x64;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x240x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
+
+using SM90_64x96x32_S32S8U8_RS_TN = SM90::GMMA::MMA_64x96x32_S32S8U8_RS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x96x32_S32S8U8_RS_TN>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_240,_32>;
+  using Shape_MNK = Shape<_64,_96,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<240, 32>;
-  using CLayout = GMMA::CLayout_64x240;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 96, 32>;
+  using CLayout = GMMA::CLayout_64x96;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x240x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
+
+using SM90_64x96x32_S32S8U8_RS_TN_SATURATE = SM90::GMMA::MMA_64x96x32_S32S8U8_RS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x96x32_S32S8U8_RS_TN_SATURATE>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_240,_32>;
+  using Shape_MNK = Shape<_64,_96,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<240, 32>;
-  using CLayout = GMMA::CLayout_64x240;
+  using BLayout = GMMA::ABLayout< 96, 32>;
+  using CLayout = GMMA::CLayout_64x96;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x256x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
+
+using SM90_64x128x32_S32S8U8_RS_TN = SM90::GMMA::MMA_64x128x32_S32S8U8_RS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x128x32_S32S8U8_RS_TN>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_256,_32>;
+  using Shape_MNK = Shape<_64,_128,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<256, 32>;
-  using CLayout = GMMA::CLayout_64x256;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<128, 32>;
+  using CLayout = GMMA::CLayout_64x128;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x256x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
+
+using SM90_64x128x32_S32S8U8_RS_TN_SATURATE = SM90::GMMA::MMA_64x128x32_S32S8U8_RS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x128x32_S32S8U8_RS_TN_SATURATE>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_256,_32>;
+  using Shape_MNK = Shape<_64,_128,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<256, 32>;
-  using CLayout = GMMA::CLayout_64x256;
+  using BLayout = GMMA::ABLayout<128, 32>;
+  using CLayout = GMMA::CLayout_64x128;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x256x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
+
+using SM90_64x192x32_S32S8U8_RS_TN = SM90::GMMA::MMA_64x192x32_S32S8U8_RS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x192x32_S32S8U8_RS_TN>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_256,_32>;
+  using Shape_MNK = Shape<_64,_192,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<256, 32>;
-  using CLayout = GMMA::CLayout_64x256;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<192, 32>;
+  using CLayout = GMMA::CLayout_64x192;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x256x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
+
+using SM90_64x192x32_S32S8U8_RS_TN_SATURATE = SM90::GMMA::MMA_64x192x32_S32S8U8_RS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x192x32_S32S8U8_RS_TN_SATURATE>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_256,_32>;
+  using Shape_MNK = Shape<_64,_192,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<256, 32>;
-  using CLayout = GMMA::CLayout_64x256;
+  using BLayout = GMMA::ABLayout<192, 32>;
+  using CLayout = GMMA::CLayout_64x192;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x8x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
+
+using SM90_64x256x32_S32S8U8_RS_TN = SM90::GMMA::MMA_64x256x32_S32S8U8_RS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x256x32_S32S8U8_RS_TN>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_8,_32>;
+  using Shape_MNK = Shape<_64,_256,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<  8, 32>;
-  using CLayout = GMMA::CLayout_64x8;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<256, 32>;
+  using CLayout = GMMA::CLayout_64x256;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x8x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
+
+using SM90_64x256x32_S32S8U8_RS_TN_SATURATE = SM90::GMMA::MMA_64x256x32_S32S8U8_RS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x256x32_S32S8U8_RS_TN_SATURATE>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_8,_32>;
+  using Shape_MNK = Shape<_64,_256,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<  8, 32>;
-  using CLayout = GMMA::CLayout_64x8;
+  using BLayout = GMMA::ABLayout<256, 32>;
+  using CLayout = GMMA::CLayout_64x256;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x8x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
+
+using SM90_64x8x32_S32U8S8_SS_TN = SM90::GMMA::MMA_64x8x32_S32U8S8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x8x32_S32U8S8_SS_TN>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
@@ -11252,19 +3929,23 @@ struct MMA_Traits<SM90_64x8x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x8x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
+
+using SM90_64x8x32_S32U8S8_SS_TN_SATURATE = SM90::GMMA::MMA_64x8x32_S32U8S8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x8x32_S32U8S8_SS_TN_SATURATE>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
   using Shape_MNK = Shape<_64,_8,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
+  using ALayout = GMMA::ABLayout< 64, 32>;
   using BLayout = GMMA::ABLayout<  8, 32>;
   using CLayout = GMMA::CLayout_64x8;
 
@@ -11273,13 +3954,16 @@ struct MMA_Traits<SM90_64x8x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x16x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
+
+using SM90_64x16x32_S32U8S8_SS_TN = SM90::GMMA::MMA_64x16x32_S32U8S8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x16x32_S32U8S8_SS_TN>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
@@ -11295,34 +3979,16 @@ struct MMA_Traits<SM90_64x16x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x16x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = half_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_16,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 16, 32>;
-  using CLayout = GMMA::CLayout_64x16;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
 
-////////////////////////////////////////////////////////////////////////////////////////////////////
+using SM90_64x16x32_S32U8S8_SS_TN_SATURATE = SM90::GMMA::MMA_64x16x32_S32U8S8_SS_TN_SATURATE;
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x16x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
+template <>
+struct MMA_Traits<SM90_64x16x32_S32U8S8_SS_TN_SATURATE>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
@@ -11338,34 +4004,41 @@ struct MMA_Traits<SM90_64x16x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x16x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
+
+using SM90_64x32x32_S32U8S8_SS_TN = SM90::GMMA::MMA_64x32x32_S32U8S8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x32x32_S32U8S8_SS_TN>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_16,_32>;
+  using Shape_MNK = Shape<_64,_32,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 16, 32>;
-  using CLayout = GMMA::CLayout_64x16;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 32, 32>;
+  using CLayout = GMMA::CLayout_64x32;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x32x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
+
+using SM90_64x32x32_S32U8S8_SS_TN_SATURATE = SM90::GMMA::MMA_64x32x32_S32U8S8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x32x32_S32U8S8_SS_TN_SATURATE>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
@@ -11381,416 +4054,458 @@ struct MMA_Traits<SM90_64x32x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x32x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
+
+using SM90_64x64x32_S32U8S8_SS_TN = SM90::GMMA::MMA_64x64x32_S32U8S8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x64x32_S32U8S8_SS_TN>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_32,_32>;
+  using Shape_MNK = Shape<_64,_64,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 32, 32>;
-  using CLayout = GMMA::CLayout_64x32;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 64, 32>;
+  using CLayout = GMMA::CLayout_64x64;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x32x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
+
+using SM90_64x64x32_S32U8S8_SS_TN_SATURATE = SM90::GMMA::MMA_64x64x32_S32U8S8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x64x32_S32U8S8_SS_TN_SATURATE>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_32,_32>;
+  using Shape_MNK = Shape<_64,_64,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 32, 32>;
-  using CLayout = GMMA::CLayout_64x32;
+  using BLayout = GMMA::ABLayout< 64, 32>;
+  using CLayout = GMMA::CLayout_64x64;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x32x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
+
+using SM90_64x96x32_S32U8S8_SS_TN = SM90::GMMA::MMA_64x96x32_S32U8S8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x96x32_S32U8S8_SS_TN>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_32,_32>;
+  using Shape_MNK = Shape<_64,_96,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 32, 32>;
-  using CLayout = GMMA::CLayout_64x32;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 96, 32>;
+  using CLayout = GMMA::CLayout_64x96;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x48x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
+
+using SM90_64x96x32_S32U8S8_SS_TN_SATURATE = SM90::GMMA::MMA_64x96x32_S32U8S8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x96x32_S32U8S8_SS_TN_SATURATE>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_48,_32>;
+  using Shape_MNK = Shape<_64,_96,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 48, 32>;
-  using CLayout = GMMA::CLayout_64x48;
+  using BLayout = GMMA::ABLayout< 96, 32>;
+  using CLayout = GMMA::CLayout_64x96;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x48x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
+
+using SM90_64x128x32_S32U8S8_SS_TN = SM90::GMMA::MMA_64x128x32_S32U8S8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x128x32_S32U8S8_SS_TN>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_48,_32>;
+  using Shape_MNK = Shape<_64,_128,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 48, 32>;
-  using CLayout = GMMA::CLayout_64x48;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<128, 32>;
+  using CLayout = GMMA::CLayout_64x128;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x48x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
+
+using SM90_64x128x32_S32U8S8_SS_TN_SATURATE = SM90::GMMA::MMA_64x128x32_S32U8S8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x128x32_S32U8S8_SS_TN_SATURATE>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_48,_32>;
+  using Shape_MNK = Shape<_64,_128,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 48, 32>;
-  using CLayout = GMMA::CLayout_64x48;
+  using BLayout = GMMA::ABLayout<128, 32>;
+  using CLayout = GMMA::CLayout_64x128;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x48x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
+
+using SM90_64x192x32_S32U8S8_SS_TN = SM90::GMMA::MMA_64x192x32_S32U8S8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x192x32_S32U8S8_SS_TN>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_48,_32>;
+  using Shape_MNK = Shape<_64,_192,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 48, 32>;
-  using CLayout = GMMA::CLayout_64x48;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<192, 32>;
+  using CLayout = GMMA::CLayout_64x192;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x64x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
+
+using SM90_64x192x32_S32U8S8_SS_TN_SATURATE = SM90::GMMA::MMA_64x192x32_S32U8S8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x192x32_S32U8S8_SS_TN_SATURATE>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_64,_32>;
+  using Shape_MNK = Shape<_64,_192,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 64, 32>;
-  using CLayout = GMMA::CLayout_64x64;
+  using BLayout = GMMA::ABLayout<192, 32>;
+  using CLayout = GMMA::CLayout_64x192;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x64x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
+
+using SM90_64x256x32_S32U8S8_SS_TN = SM90::GMMA::MMA_64x256x32_S32U8S8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x256x32_S32U8S8_SS_TN>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_64,_32>;
+  using Shape_MNK = Shape<_64,_256,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 64, 32>;
-  using CLayout = GMMA::CLayout_64x64;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<256, 32>;
+  using CLayout = GMMA::CLayout_64x256;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x64x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
+
+using SM90_64x256x32_S32U8S8_SS_TN_SATURATE = SM90::GMMA::MMA_64x256x32_S32U8S8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x256x32_S32U8S8_SS_TN_SATURATE>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_64,_32>;
+  using Shape_MNK = Shape<_64,_256,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 64, 32>;
-  using CLayout = GMMA::CLayout_64x64;
+  using BLayout = GMMA::ABLayout<256, 32>;
+  using CLayout = GMMA::CLayout_64x256;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x64x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
+
+using SM90_64x8x32_S32U8S8_RS_TN = SM90::GMMA::MMA_64x8x32_S32U8S8_RS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x8x32_S32U8S8_RS_TN>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_64,_32>;
+  using Shape_MNK = Shape<_64,_8,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 64, 32>;
-  using CLayout = GMMA::CLayout_64x64;
+  using BLayout = GMMA::ABLayout<  8, 32>;
+  using CLayout = GMMA::CLayout_64x8;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x80x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
+
+using SM90_64x8x32_S32U8S8_RS_TN_SATURATE = SM90::GMMA::MMA_64x8x32_S32U8S8_RS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x8x32_S32U8S8_RS_TN_SATURATE>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_80,_32>;
+  using Shape_MNK = Shape<_64,_8,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 80, 32>;
-  using CLayout = GMMA::CLayout_64x80;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<  8, 32>;
+  using CLayout = GMMA::CLayout_64x8;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x80x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
+
+using SM90_64x16x32_S32U8S8_RS_TN = SM90::GMMA::MMA_64x16x32_S32U8S8_RS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x16x32_S32U8S8_RS_TN>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_80,_32>;
+  using Shape_MNK = Shape<_64,_16,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 80, 32>;
-  using CLayout = GMMA::CLayout_64x80;
+  using BLayout = GMMA::ABLayout< 16, 32>;
+  using CLayout = GMMA::CLayout_64x16;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x80x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
+
+using SM90_64x16x32_S32U8S8_RS_TN_SATURATE = SM90::GMMA::MMA_64x16x32_S32U8S8_RS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x16x32_S32U8S8_RS_TN_SATURATE>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_80,_32>;
+  using Shape_MNK = Shape<_64,_16,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 80, 32>;
-  using CLayout = GMMA::CLayout_64x80;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 16, 32>;
+  using CLayout = GMMA::CLayout_64x16;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x80x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
+
+using SM90_64x32x32_S32U8S8_RS_TN = SM90::GMMA::MMA_64x32x32_S32U8S8_RS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x32x32_S32U8S8_RS_TN>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_80,_32>;
+  using Shape_MNK = Shape<_64,_32,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 80, 32>;
-  using CLayout = GMMA::CLayout_64x80;
+  using BLayout = GMMA::ABLayout< 32, 32>;
+  using CLayout = GMMA::CLayout_64x32;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x96x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
+
+using SM90_64x32x32_S32U8S8_RS_TN_SATURATE = SM90::GMMA::MMA_64x32x32_S32U8S8_RS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x32x32_S32U8S8_RS_TN_SATURATE>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_96,_32>;
+  using Shape_MNK = Shape<_64,_32,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 96, 32>;
-  using CLayout = GMMA::CLayout_64x96;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 32, 32>;
+  using CLayout = GMMA::CLayout_64x32;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x96x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
+
+using SM90_64x64x32_S32U8S8_RS_TN = SM90::GMMA::MMA_64x64x32_S32U8S8_RS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x64x32_S32U8S8_RS_TN>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_96,_32>;
+  using Shape_MNK = Shape<_64,_64,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 96, 32>;
-  using CLayout = GMMA::CLayout_64x96;
+  using BLayout = GMMA::ABLayout< 64, 32>;
+  using CLayout = GMMA::CLayout_64x64;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x96x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
+
+using SM90_64x64x32_S32U8S8_RS_TN_SATURATE = SM90::GMMA::MMA_64x64x32_S32U8S8_RS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x64x32_S32U8S8_RS_TN_SATURATE>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_96,_32>;
+  using Shape_MNK = Shape<_64,_64,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 96, 32>;
-  using CLayout = GMMA::CLayout_64x96;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 64, 32>;
+  using CLayout = GMMA::CLayout_64x64;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x96x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
+
+using SM90_64x96x32_S32U8S8_RS_TN = SM90::GMMA::MMA_64x96x32_S32U8S8_RS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x96x32_S32U8S8_RS_TN>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
@@ -11805,475 +4520,484 @@ struct MMA_Traits<SM90_64x96x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x112x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = half_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_112,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<112, 32>;
-  using CLayout = GMMA::CLayout_64x112;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
 
-////////////////////////////////////////////////////////////////////////////////////////////////////
+using SM90_64x96x32_S32U8S8_RS_TN_SATURATE = SM90::GMMA::MMA_64x96x32_S32U8S8_RS_TN_SATURATE;
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x112x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
+template <>
+struct MMA_Traits<SM90_64x96x32_S32U8S8_RS_TN_SATURATE>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_112,_32>;
+  using Shape_MNK = Shape<_64,_96,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<112, 32>;
-  using CLayout = GMMA::CLayout_64x112;
+  using BLayout = GMMA::ABLayout< 96, 32>;
+  using CLayout = GMMA::CLayout_64x96;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x112x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
+
+using SM90_64x128x32_S32U8S8_RS_TN = SM90::GMMA::MMA_64x128x32_S32U8S8_RS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x128x32_S32U8S8_RS_TN>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_112,_32>;
+  using Shape_MNK = Shape<_64,_128,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<112, 32>;
-  using CLayout = GMMA::CLayout_64x112;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<128, 32>;
+  using CLayout = GMMA::CLayout_64x128;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x112x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
+
+using SM90_64x128x32_S32U8S8_RS_TN_SATURATE = SM90::GMMA::MMA_64x128x32_S32U8S8_RS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x128x32_S32U8S8_RS_TN_SATURATE>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_112,_32>;
+  using Shape_MNK = Shape<_64,_128,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<112, 32>;
-  using CLayout = GMMA::CLayout_64x112;
+  using BLayout = GMMA::ABLayout<128, 32>;
+  using CLayout = GMMA::CLayout_64x128;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x128x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
+
+using SM90_64x192x32_S32U8S8_RS_TN = SM90::GMMA::MMA_64x192x32_S32U8S8_RS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x192x32_S32U8S8_RS_TN>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_128,_32>;
+  using Shape_MNK = Shape<_64,_192,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<128, 32>;
-  using CLayout = GMMA::CLayout_64x128;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<192, 32>;
+  using CLayout = GMMA::CLayout_64x192;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x128x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
+
+using SM90_64x192x32_S32U8S8_RS_TN_SATURATE = SM90::GMMA::MMA_64x192x32_S32U8S8_RS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x192x32_S32U8S8_RS_TN_SATURATE>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_128,_32>;
+  using Shape_MNK = Shape<_64,_192,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<128, 32>;
-  using CLayout = GMMA::CLayout_64x128;
+  using BLayout = GMMA::ABLayout<192, 32>;
+  using CLayout = GMMA::CLayout_64x192;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x128x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
+
+using SM90_64x256x32_S32U8S8_RS_TN = SM90::GMMA::MMA_64x256x32_S32U8S8_RS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x256x32_S32U8S8_RS_TN>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_128,_32>;
+  using Shape_MNK = Shape<_64,_256,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<128, 32>;
-  using CLayout = GMMA::CLayout_64x128;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<256, 32>;
+  using CLayout = GMMA::CLayout_64x256;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x128x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
+
+using SM90_64x256x32_S32U8S8_RS_TN_SATURATE = SM90::GMMA::MMA_64x256x32_S32U8S8_RS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x256x32_S32U8S8_RS_TN_SATURATE>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_128,_32>;
+  using Shape_MNK = Shape<_64,_256,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<128, 32>;
-  using CLayout = GMMA::CLayout_64x128;
+  using BLayout = GMMA::ABLayout<256, 32>;
+  using CLayout = GMMA::CLayout_64x256;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x144x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
+
+using SM90_64x8x32_S32U8U8_SS_TN = SM90::GMMA::MMA_64x8x32_S32U8U8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x8x32_S32U8U8_SS_TN>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_144,_32>;
+  using Shape_MNK = Shape<_64,_8,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<144, 32>;
-  using CLayout = GMMA::CLayout_64x144;
+  using BLayout = GMMA::ABLayout<  8, 32>;
+  using CLayout = GMMA::CLayout_64x8;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x144x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
+
+using SM90_64x8x32_S32U8U8_SS_TN_SATURATE = SM90::GMMA::MMA_64x8x32_S32U8U8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x8x32_S32U8U8_SS_TN_SATURATE>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_144,_32>;
+  using Shape_MNK = Shape<_64,_8,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<144, 32>;
-  using CLayout = GMMA::CLayout_64x144;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<  8, 32>;
+  using CLayout = GMMA::CLayout_64x8;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x144x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
+
+using SM90_64x16x32_S32U8U8_SS_TN = SM90::GMMA::MMA_64x16x32_S32U8U8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x16x32_S32U8U8_SS_TN>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_144,_32>;
+  using Shape_MNK = Shape<_64,_16,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<144, 32>;
-  using CLayout = GMMA::CLayout_64x144;
+  using BLayout = GMMA::ABLayout< 16, 32>;
+  using CLayout = GMMA::CLayout_64x16;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x144x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
+
+using SM90_64x16x32_S32U8U8_SS_TN_SATURATE = SM90::GMMA::MMA_64x16x32_S32U8U8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x16x32_S32U8U8_SS_TN_SATURATE>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_144,_32>;
+  using Shape_MNK = Shape<_64,_16,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<144, 32>;
-  using CLayout = GMMA::CLayout_64x144;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 16, 32>;
+  using CLayout = GMMA::CLayout_64x16;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x160x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
+
+using SM90_64x32x32_S32U8U8_SS_TN = SM90::GMMA::MMA_64x32x32_S32U8U8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x32x32_S32U8U8_SS_TN>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_160,_32>;
+  using Shape_MNK = Shape<_64,_32,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<160, 32>;
-  using CLayout = GMMA::CLayout_64x160;
+  using BLayout = GMMA::ABLayout< 32, 32>;
+  using CLayout = GMMA::CLayout_64x32;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x160x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
+
+using SM90_64x32x32_S32U8U8_SS_TN_SATURATE = SM90::GMMA::MMA_64x32x32_S32U8U8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x32x32_S32U8U8_SS_TN_SATURATE>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_160,_32>;
+  using Shape_MNK = Shape<_64,_32,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<160, 32>;
-  using CLayout = GMMA::CLayout_64x160;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 32, 32>;
+  using CLayout = GMMA::CLayout_64x32;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x160x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
+
+using SM90_64x64x32_S32U8U8_SS_TN = SM90::GMMA::MMA_64x64x32_S32U8U8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x64x32_S32U8U8_SS_TN>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_160,_32>;
+  using Shape_MNK = Shape<_64,_64,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<160, 32>;
-  using CLayout = GMMA::CLayout_64x160;
+  using BLayout = GMMA::ABLayout< 64, 32>;
+  using CLayout = GMMA::CLayout_64x64;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x160x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
+
+using SM90_64x64x32_S32U8U8_SS_TN_SATURATE = SM90::GMMA::MMA_64x64x32_S32U8U8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x64x32_S32U8U8_SS_TN_SATURATE>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_160,_32>;
+  using Shape_MNK = Shape<_64,_64,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<160, 32>;
-  using CLayout = GMMA::CLayout_64x160;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 64, 32>;
+  using CLayout = GMMA::CLayout_64x64;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x176x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
+
+using SM90_64x96x32_S32U8U8_SS_TN = SM90::GMMA::MMA_64x96x32_S32U8U8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x96x32_S32U8U8_SS_TN>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_176,_32>;
+  using Shape_MNK = Shape<_64,_96,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<176, 32>;
-  using CLayout = GMMA::CLayout_64x176;
+  using BLayout = GMMA::ABLayout< 96, 32>;
+  using CLayout = GMMA::CLayout_64x96;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x176x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
+
+using SM90_64x96x32_S32U8U8_SS_TN_SATURATE = SM90::GMMA::MMA_64x96x32_S32U8U8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x96x32_S32U8U8_SS_TN_SATURATE>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_176,_32>;
+  using Shape_MNK = Shape<_64,_96,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<176, 32>;
-  using CLayout = GMMA::CLayout_64x176;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 96, 32>;
+  using CLayout = GMMA::CLayout_64x96;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x176x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
+
+using SM90_64x128x32_S32U8U8_SS_TN = SM90::GMMA::MMA_64x128x32_S32U8U8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x128x32_S32U8U8_SS_TN>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_176,_32>;
+  using Shape_MNK = Shape<_64,_128,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<176, 32>;
-  using CLayout = GMMA::CLayout_64x176;
+  using BLayout = GMMA::ABLayout<128, 32>;
+  using CLayout = GMMA::CLayout_64x128;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x176x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
+
+using SM90_64x128x32_S32U8U8_SS_TN_SATURATE = SM90::GMMA::MMA_64x128x32_S32U8U8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x128x32_S32U8U8_SS_TN_SATURATE>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_176,_32>;
+  using Shape_MNK = Shape<_64,_128,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<176, 32>;
-  using CLayout = GMMA::CLayout_64x176;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<128, 32>;
+  using CLayout = GMMA::CLayout_64x128;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x192x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
+
+using SM90_64x192x32_S32U8U8_SS_TN = SM90::GMMA::MMA_64x192x32_S32U8U8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x192x32_S32U8U8_SS_TN>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
@@ -12289,19 +5013,23 @@ struct MMA_Traits<SM90_64x192x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x192x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
+
+using SM90_64x192x32_S32U8U8_SS_TN_SATURATE = SM90::GMMA::MMA_64x192x32_S32U8U8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x192x32_S32U8U8_SS_TN_SATURATE>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
   using Shape_MNK = Shape<_64,_192,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
+  using ALayout = GMMA::ABLayout< 64, 32>;
   using BLayout = GMMA::ABLayout<192, 32>;
   using CLayout = GMMA::CLayout_64x192;
 
@@ -12310,388 +5038,408 @@ struct MMA_Traits<SM90_64x192x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x192x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
+
+using SM90_64x256x32_S32U8U8_SS_TN = SM90::GMMA::MMA_64x256x32_S32U8U8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x256x32_S32U8U8_SS_TN>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_192,_32>;
+  using Shape_MNK = Shape<_64,_256,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<192, 32>;
-  using CLayout = GMMA::CLayout_64x192;
+  using BLayout = GMMA::ABLayout<256, 32>;
+  using CLayout = GMMA::CLayout_64x256;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x192x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
+
+using SM90_64x256x32_S32U8U8_SS_TN_SATURATE = SM90::GMMA::MMA_64x256x32_S32U8U8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x256x32_S32U8U8_SS_TN_SATURATE>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_192,_32>;
+  using Shape_MNK = Shape<_64,_256,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<192, 32>;
-  using CLayout = GMMA::CLayout_64x192;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<256, 32>;
+  using CLayout = GMMA::CLayout_64x256;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x208x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
+
+using SM90_64x8x32_S32U8U8_RS_TN = SM90::GMMA::MMA_64x8x32_S32U8U8_RS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x8x32_S32U8U8_RS_TN>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_208,_32>;
+  using Shape_MNK = Shape<_64,_8,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<208, 32>;
-  using CLayout = GMMA::CLayout_64x208;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<  8, 32>;
+  using CLayout = GMMA::CLayout_64x8;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x208x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
+
+using SM90_64x8x32_S32U8U8_RS_TN_SATURATE = SM90::GMMA::MMA_64x8x32_S32U8U8_RS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x8x32_S32U8U8_RS_TN_SATURATE>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_208,_32>;
+  using Shape_MNK = Shape<_64,_8,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<208, 32>;
-  using CLayout = GMMA::CLayout_64x208;
+  using BLayout = GMMA::ABLayout<  8, 32>;
+  using CLayout = GMMA::CLayout_64x8;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x208x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
+
+using SM90_64x16x32_S32U8U8_RS_TN = SM90::GMMA::MMA_64x16x32_S32U8U8_RS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x16x32_S32U8U8_RS_TN>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_208,_32>;
+  using Shape_MNK = Shape<_64,_16,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<208, 32>;
-  using CLayout = GMMA::CLayout_64x208;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 16, 32>;
+  using CLayout = GMMA::CLayout_64x16;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x208x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
+
+using SM90_64x16x32_S32U8U8_RS_TN_SATURATE = SM90::GMMA::MMA_64x16x32_S32U8U8_RS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x16x32_S32U8U8_RS_TN_SATURATE>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_208,_32>;
+  using Shape_MNK = Shape<_64,_16,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<208, 32>;
-  using CLayout = GMMA::CLayout_64x208;
+  using BLayout = GMMA::ABLayout< 16, 32>;
+  using CLayout = GMMA::CLayout_64x16;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x224x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
+
+using SM90_64x32x32_S32U8U8_RS_TN = SM90::GMMA::MMA_64x32x32_S32U8U8_RS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x32x32_S32U8U8_RS_TN>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_224,_32>;
+  using Shape_MNK = Shape<_64,_32,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<224, 32>;
-  using CLayout = GMMA::CLayout_64x224;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 32, 32>;
+  using CLayout = GMMA::CLayout_64x32;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x224x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
+
+using SM90_64x32x32_S32U8U8_RS_TN_SATURATE = SM90::GMMA::MMA_64x32x32_S32U8U8_RS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x32x32_S32U8U8_RS_TN_SATURATE>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_224,_32>;
+  using Shape_MNK = Shape<_64,_32,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<224, 32>;
-  using CLayout = GMMA::CLayout_64x224;
+  using BLayout = GMMA::ABLayout< 32, 32>;
+  using CLayout = GMMA::CLayout_64x32;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x224x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
+
+using SM90_64x64x32_S32U8U8_RS_TN = SM90::GMMA::MMA_64x64x32_S32U8U8_RS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x64x32_S32U8U8_RS_TN>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_224,_32>;
+  using Shape_MNK = Shape<_64,_64,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<224, 32>;
-  using CLayout = GMMA::CLayout_64x224;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 64, 32>;
+  using CLayout = GMMA::CLayout_64x64;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x224x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
+
+using SM90_64x64x32_S32U8U8_RS_TN_SATURATE = SM90::GMMA::MMA_64x64x32_S32U8U8_RS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x64x32_S32U8U8_RS_TN_SATURATE>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_224,_32>;
+  using Shape_MNK = Shape<_64,_64,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<224, 32>;
-  using CLayout = GMMA::CLayout_64x224;
+  using BLayout = GMMA::ABLayout< 64, 32>;
+  using CLayout = GMMA::CLayout_64x64;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x240x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
+
+using SM90_64x96x32_S32U8U8_RS_TN = SM90::GMMA::MMA_64x96x32_S32U8U8_RS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x96x32_S32U8U8_RS_TN>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_240,_32>;
+  using Shape_MNK = Shape<_64,_96,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<240, 32>;
-  using CLayout = GMMA::CLayout_64x240;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 96, 32>;
+  using CLayout = GMMA::CLayout_64x96;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x240x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
+
+using SM90_64x96x32_S32U8U8_RS_TN_SATURATE = SM90::GMMA::MMA_64x96x32_S32U8U8_RS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x96x32_S32U8U8_RS_TN_SATURATE>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_240,_32>;
+  using Shape_MNK = Shape<_64,_96,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<240, 32>;
-  using CLayout = GMMA::CLayout_64x240;
+  using BLayout = GMMA::ABLayout< 96, 32>;
+  using CLayout = GMMA::CLayout_64x96;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x240x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
+
+using SM90_64x128x32_S32U8U8_RS_TN = SM90::GMMA::MMA_64x128x32_S32U8U8_RS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x128x32_S32U8U8_RS_TN>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_240,_32>;
+  using Shape_MNK = Shape<_64,_128,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<240, 32>;
-  using CLayout = GMMA::CLayout_64x240;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<128, 32>;
+  using CLayout = GMMA::CLayout_64x128;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x240x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
+
+using SM90_64x128x32_S32U8U8_RS_TN_SATURATE = SM90::GMMA::MMA_64x128x32_S32U8U8_RS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x128x32_S32U8U8_RS_TN_SATURATE>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_240,_32>;
+  using Shape_MNK = Shape<_64,_128,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<240, 32>;
-  using CLayout = GMMA::CLayout_64x240;
+  using BLayout = GMMA::ABLayout<128, 32>;
+  using CLayout = GMMA::CLayout_64x128;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x256x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
+
+using SM90_64x192x32_S32U8U8_RS_TN = SM90::GMMA::MMA_64x192x32_S32U8U8_RS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x192x32_S32U8U8_RS_TN>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_256,_32>;
+  using Shape_MNK = Shape<_64,_192,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<256, 32>;
-  using CLayout = GMMA::CLayout_64x256;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<192, 32>;
+  using CLayout = GMMA::CLayout_64x192;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x256x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
+
+using SM90_64x192x32_S32U8U8_RS_TN_SATURATE = SM90::GMMA::MMA_64x192x32_S32U8U8_RS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x192x32_S32U8U8_RS_TN_SATURATE>
 {
-  using ValTypeD = half_t;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = half_t;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_256,_32>;
+  using Shape_MNK = Shape<_64,_192,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<256, 32>;
-  using CLayout = GMMA::CLayout_64x256;
+  using BLayout = GMMA::ABLayout<192, 32>;
+  using CLayout = GMMA::CLayout_64x192;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x256x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = float;
 
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+using SM90_64x256x32_S32U8U8_RS_TN = SM90::GMMA::MMA_64x256x32_S32U8U8_RS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x256x32_S32U8U8_RS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
   using Shape_MNK = Shape<_64,_256,_32>;
   using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ALayout = GMMA::ALayout_64x32;
   using BLayout = GMMA::ABLayout<256, 32>;
   using CLayout = GMMA::CLayout_64x256;
 
@@ -12700,13 +5448,16 @@ struct MMA_Traits<SM90_64x256x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x256x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
+
+using SM90_64x256x32_S32U8U8_RS_TN_SATURATE = SM90::GMMA::MMA_64x256x32_S32U8U8_RS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x256x32_S32U8U8_RS_TN_SATURATE>
 {
-  using ValTypeD = float;
-  using ValTypeA = float_e4m3_t;
-  using ValTypeB = float_e5m2_t;
-  using ValTypeC = float;
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
@@ -12721,11 +5472,17 @@ struct MMA_Traits<SM90_64x256x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x8x32_F16E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x8x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x8x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x8x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
-  using ValTypeA = float_e5m2_t;
+  using ValTypeA = float_e4m3_t;
   using ValTypeB = float_e4m3_t;
   using ValTypeC = half_t;
 
@@ -12743,11 +5500,17 @@ struct MMA_Traits<SM90_64x8x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x8x32_F16E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x8x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x8x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x8x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
-  using ValTypeA = float_e5m2_t;
+  using ValTypeA = float_e4m3_t;
   using ValTypeB = float_e4m3_t;
   using ValTypeC = half_t;
 
@@ -12764,11 +5527,17 @@ struct MMA_Traits<SM90_64x8x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x8x32_F32E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x8x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x8x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x8x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = float_e5m2_t;
+  using ValTypeA = float_e4m3_t;
   using ValTypeB = float_e4m3_t;
   using ValTypeC = float;
 
@@ -12786,11 +5555,17 @@ struct MMA_Traits<SM90_64x8x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x8x32_F32E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x8x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x8x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x8x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = float_e5m2_t;
+  using ValTypeA = float_e4m3_t;
   using ValTypeB = float_e4m3_t;
   using ValTypeC = float;
 
@@ -12807,11 +5582,17 @@ struct MMA_Traits<SM90_64x8x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x16x32_F16E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x16x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x16x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x16x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
-  using ValTypeA = float_e5m2_t;
+  using ValTypeA = float_e4m3_t;
   using ValTypeB = float_e4m3_t;
   using ValTypeC = half_t;
 
@@ -12829,11 +5610,17 @@ struct MMA_Traits<SM90_64x16x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x16x32_F16E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x16x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x16x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x16x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
-  using ValTypeA = float_e5m2_t;
+  using ValTypeA = float_e4m3_t;
   using ValTypeB = float_e4m3_t;
   using ValTypeC = half_t;
 
@@ -12850,11 +5637,17 @@ struct MMA_Traits<SM90_64x16x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x16x32_F32E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x16x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x16x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x16x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = float_e5m2_t;
+  using ValTypeA = float_e4m3_t;
   using ValTypeB = float_e4m3_t;
   using ValTypeC = float;
 
@@ -12872,11 +5665,17 @@ struct MMA_Traits<SM90_64x16x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x16x32_F32E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x16x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x16x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x16x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = float_e5m2_t;
+  using ValTypeA = float_e4m3_t;
   using ValTypeB = float_e4m3_t;
   using ValTypeC = float;
 
@@ -12893,11 +5692,17 @@ struct MMA_Traits<SM90_64x16x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x32x32_F16E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x32x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x32x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x32x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
-  using ValTypeA = float_e5m2_t;
+  using ValTypeA = float_e4m3_t;
   using ValTypeB = float_e4m3_t;
   using ValTypeC = half_t;
 
@@ -12915,11 +5720,17 @@ struct MMA_Traits<SM90_64x32x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x32x32_F16E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x32x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x32x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x32x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
-  using ValTypeA = float_e5m2_t;
+  using ValTypeA = float_e4m3_t;
   using ValTypeB = float_e4m3_t;
   using ValTypeC = half_t;
 
@@ -12936,11 +5747,17 @@ struct MMA_Traits<SM90_64x32x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x32x32_F32E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x32x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x32x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x32x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = float_e5m2_t;
+  using ValTypeA = float_e4m3_t;
   using ValTypeB = float_e4m3_t;
   using ValTypeC = float;
 
@@ -12958,11 +5775,17 @@ struct MMA_Traits<SM90_64x32x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x32x32_F32E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x32x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x32x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x32x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = float_e5m2_t;
+  using ValTypeA = float_e4m3_t;
   using ValTypeB = float_e4m3_t;
   using ValTypeC = float;
 
@@ -12979,105 +5802,17 @@ struct MMA_Traits<SM90_64x32x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x48x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = half_t;
-  using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = half_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_48,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 48, 32>;
-  using CLayout = GMMA::CLayout_64x48;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x48x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = half_t;
-  using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = half_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_48,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 48, 32>;
-  using CLayout = GMMA::CLayout_64x48;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x48x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = float;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_48,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 48, 32>;
-  using CLayout = GMMA::CLayout_64x48;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x48x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = float;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_48,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 48, 32>;
-  using CLayout = GMMA::CLayout_64x48;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x64x32_F16E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x64x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>;
 
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x64x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x64x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
-  using ValTypeA = float_e5m2_t;
+  using ValTypeA = float_e4m3_t;
   using ValTypeB = float_e4m3_t;
   using ValTypeC = half_t;
 
@@ -13095,11 +5830,17 @@ struct MMA_Traits<SM90_64x64x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x64x32_F16E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x64x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x64x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x64x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
-  using ValTypeA = float_e5m2_t;
+  using ValTypeA = float_e4m3_t;
   using ValTypeB = float_e4m3_t;
   using ValTypeC = half_t;
 
@@ -13116,11 +5857,17 @@ struct MMA_Traits<SM90_64x64x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x64x32_F32E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x64x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x64x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x64x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = float_e5m2_t;
+  using ValTypeA = float_e4m3_t;
   using ValTypeB = float_e4m3_t;
   using ValTypeC = float;
 
@@ -13138,126 +5885,44 @@ struct MMA_Traits<SM90_64x64x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x64x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = float;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_64,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 64, 32>;
-  using CLayout = GMMA::CLayout_64x64;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x80x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = half_t;
-  using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = half_t;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_80,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 80, 32>;
-  using CLayout = GMMA::CLayout_64x80;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x80x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = half_t;
-  using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = half_t;
-
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_80,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 80, 32>;
-  using CLayout = GMMA::CLayout_64x80;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
-
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
-template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x80x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
-{
-  using ValTypeD = float;
-  using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e4m3_t;
-  using ValTypeC = float;
-
-  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
-  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
-
-  using Shape_MNK = Shape<_64,_80,_32>;
-  using ThrID   = Layout<_128>;
-  using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 80, 32>;
-  using CLayout = GMMA::CLayout_64x80;
-
-  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
-};
-#endif
-
-////////////////////////////////////////////////////////////////////////////////////////////////////
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x64x32_F32E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x64x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>;
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x80x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x64x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = float_e5m2_t;
+  using ValTypeA = float_e4m3_t;
   using ValTypeB = float_e4m3_t;
   using ValTypeC = float;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_80,_32>;
+  using Shape_MNK = Shape<_64,_64,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 80, 32>;
-  using CLayout = GMMA::CLayout_64x80;
+  using BLayout = GMMA::ABLayout< 64, 32>;
+  using CLayout = GMMA::CLayout_64x64;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x96x32_F16E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x96x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x96x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x96x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
-  using ValTypeA = float_e5m2_t;
+  using ValTypeA = float_e4m3_t;
   using ValTypeB = float_e4m3_t;
   using ValTypeC = half_t;
 
@@ -13275,11 +5940,17 @@ struct MMA_Traits<SM90_64x96x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x96x32_F16E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x96x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x96x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x96x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
-  using ValTypeA = float_e5m2_t;
+  using ValTypeA = float_e4m3_t;
   using ValTypeB = float_e4m3_t;
   using ValTypeC = half_t;
 
@@ -13296,11 +5967,17 @@ struct MMA_Traits<SM90_64x96x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x96x32_F32E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x96x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x96x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x96x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = float_e5m2_t;
+  using ValTypeA = float_e4m3_t;
   using ValTypeB = float_e4m3_t;
   using ValTypeC = float;
 
@@ -13318,11 +5995,17 @@ struct MMA_Traits<SM90_64x96x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x96x32_F32E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x96x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x96x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x96x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = float_e5m2_t;
+  using ValTypeA = float_e4m3_t;
   using ValTypeB = float_e4m3_t;
   using ValTypeC = float;
 
@@ -13339,1280 +6022,1558 @@ struct MMA_Traits<SM90_64x96x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x128x32_F16E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x128x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x112x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x128x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
-  using ValTypeA = float_e5m2_t;
+  using ValTypeA = float_e4m3_t;
   using ValTypeB = float_e4m3_t;
   using ValTypeC = half_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_112,_32>;
+  using Shape_MNK = Shape<_64,_128,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<112, 32>;
-  using CLayout = GMMA::CLayout_64x112;
+  using BLayout = GMMA::ABLayout<128, 32>;
+  using CLayout = GMMA::CLayout_64x128;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x128x32_F16E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x128x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x112x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x128x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
-  using ValTypeA = float_e5m2_t;
+  using ValTypeA = float_e4m3_t;
   using ValTypeB = float_e4m3_t;
   using ValTypeC = half_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_112,_32>;
+  using Shape_MNK = Shape<_64,_128,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<112, 32>;
-  using CLayout = GMMA::CLayout_64x112;
+  using BLayout = GMMA::ABLayout<128, 32>;
+  using CLayout = GMMA::CLayout_64x128;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x128x32_F32E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x128x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x112x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x128x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = float_e5m2_t;
+  using ValTypeA = float_e4m3_t;
   using ValTypeB = float_e4m3_t;
   using ValTypeC = float;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_112,_32>;
+  using Shape_MNK = Shape<_64,_128,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<112, 32>;
-  using CLayout = GMMA::CLayout_64x112;
+  using BLayout = GMMA::ABLayout<128, 32>;
+  using CLayout = GMMA::CLayout_64x128;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x128x32_F32E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x128x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x112x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x128x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = float_e5m2_t;
+  using ValTypeA = float_e4m3_t;
   using ValTypeB = float_e4m3_t;
   using ValTypeC = float;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_112,_32>;
+  using Shape_MNK = Shape<_64,_128,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<112, 32>;
-  using CLayout = GMMA::CLayout_64x112;
+  using BLayout = GMMA::ABLayout<128, 32>;
+  using CLayout = GMMA::CLayout_64x128;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x192x32_F16E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x192x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x128x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x192x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
-  using ValTypeA = float_e5m2_t;
+  using ValTypeA = float_e4m3_t;
   using ValTypeB = float_e4m3_t;
   using ValTypeC = half_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_128,_32>;
+  using Shape_MNK = Shape<_64,_192,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<128, 32>;
-  using CLayout = GMMA::CLayout_64x128;
+  using BLayout = GMMA::ABLayout<192, 32>;
+  using CLayout = GMMA::CLayout_64x192;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x192x32_F16E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x192x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x128x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x192x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
-  using ValTypeA = float_e5m2_t;
+  using ValTypeA = float_e4m3_t;
   using ValTypeB = float_e4m3_t;
   using ValTypeC = half_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_128,_32>;
+  using Shape_MNK = Shape<_64,_192,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<128, 32>;
-  using CLayout = GMMA::CLayout_64x128;
+  using BLayout = GMMA::ABLayout<192, 32>;
+  using CLayout = GMMA::CLayout_64x192;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x192x32_F32E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x192x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x128x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x192x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = float_e5m2_t;
+  using ValTypeA = float_e4m3_t;
   using ValTypeB = float_e4m3_t;
   using ValTypeC = float;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_128,_32>;
+  using Shape_MNK = Shape<_64,_192,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<128, 32>;
-  using CLayout = GMMA::CLayout_64x128;
+  using BLayout = GMMA::ABLayout<192, 32>;
+  using CLayout = GMMA::CLayout_64x192;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x192x32_F32E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x192x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x128x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x192x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = float_e5m2_t;
+  using ValTypeA = float_e4m3_t;
   using ValTypeB = float_e4m3_t;
   using ValTypeC = float;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_128,_32>;
+  using Shape_MNK = Shape<_64,_192,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<128, 32>;
-  using CLayout = GMMA::CLayout_64x128;
+  using BLayout = GMMA::ABLayout<192, 32>;
+  using CLayout = GMMA::CLayout_64x192;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x256x32_F16E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x256x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x144x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x256x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
-  using ValTypeA = float_e5m2_t;
+  using ValTypeA = float_e4m3_t;
   using ValTypeB = float_e4m3_t;
   using ValTypeC = half_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_144,_32>;
+  using Shape_MNK = Shape<_64,_256,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<144, 32>;
-  using CLayout = GMMA::CLayout_64x144;
+  using BLayout = GMMA::ABLayout<256, 32>;
+  using CLayout = GMMA::CLayout_64x256;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x256x32_F16E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x256x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x144x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x256x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
-  using ValTypeA = float_e5m2_t;
+  using ValTypeA = float_e4m3_t;
   using ValTypeB = float_e4m3_t;
   using ValTypeC = half_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_144,_32>;
+  using Shape_MNK = Shape<_64,_256,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<144, 32>;
-  using CLayout = GMMA::CLayout_64x144;
+  using BLayout = GMMA::ABLayout<256, 32>;
+  using CLayout = GMMA::CLayout_64x256;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x256x32_F32E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x256x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x144x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x256x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = float_e5m2_t;
+  using ValTypeA = float_e4m3_t;
   using ValTypeB = float_e4m3_t;
   using ValTypeC = float;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_144,_32>;
+  using Shape_MNK = Shape<_64,_256,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<144, 32>;
-  using CLayout = GMMA::CLayout_64x144;
+  using BLayout = GMMA::ABLayout<256, 32>;
+  using CLayout = GMMA::CLayout_64x256;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x256x32_F32E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x256x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x144x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x256x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = float_e5m2_t;
+  using ValTypeA = float_e4m3_t;
   using ValTypeB = float_e4m3_t;
   using ValTypeC = float;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_144,_32>;
+  using Shape_MNK = Shape<_64,_256,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<144, 32>;
-  using CLayout = GMMA::CLayout_64x144;
+  using BLayout = GMMA::ABLayout<256, 32>;
+  using CLayout = GMMA::CLayout_64x256;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x8x32_F16E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x8x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x160x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x8x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
-  using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e4m3_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
   using ValTypeC = half_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_160,_32>;
+  using Shape_MNK = Shape<_64,_8,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<160, 32>;
-  using CLayout = GMMA::CLayout_64x160;
+  using BLayout = GMMA::ABLayout<  8, 32>;
+  using CLayout = GMMA::CLayout_64x8;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x8x32_F16E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x8x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x160x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x8x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
-  using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e4m3_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
   using ValTypeC = half_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_160,_32>;
+  using Shape_MNK = Shape<_64,_8,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<160, 32>;
-  using CLayout = GMMA::CLayout_64x160;
+  using BLayout = GMMA::ABLayout<  8, 32>;
+  using CLayout = GMMA::CLayout_64x8;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x8x32_F32E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x8x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x160x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x8x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e4m3_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
   using ValTypeC = float;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_160,_32>;
+  using Shape_MNK = Shape<_64,_8,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<160, 32>;
-  using CLayout = GMMA::CLayout_64x160;
+  using BLayout = GMMA::ABLayout<  8, 32>;
+  using CLayout = GMMA::CLayout_64x8;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x8x32_F32E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x8x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x160x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x8x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e4m3_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
   using ValTypeC = float;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_160,_32>;
+  using Shape_MNK = Shape<_64,_8,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<160, 32>;
-  using CLayout = GMMA::CLayout_64x160;
+  using BLayout = GMMA::ABLayout<  8, 32>;
+  using CLayout = GMMA::CLayout_64x8;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x16x32_F16E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x16x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x176x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x16x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
-  using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e4m3_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
   using ValTypeC = half_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_176,_32>;
+  using Shape_MNK = Shape<_64,_16,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<176, 32>;
-  using CLayout = GMMA::CLayout_64x176;
+  using BLayout = GMMA::ABLayout< 16, 32>;
+  using CLayout = GMMA::CLayout_64x16;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x16x32_F16E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x16x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x176x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x16x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
-  using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e4m3_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
   using ValTypeC = half_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_176,_32>;
+  using Shape_MNK = Shape<_64,_16,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<176, 32>;
-  using CLayout = GMMA::CLayout_64x176;
+  using BLayout = GMMA::ABLayout< 16, 32>;
+  using CLayout = GMMA::CLayout_64x16;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x16x32_F32E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x16x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x176x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x16x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e4m3_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
   using ValTypeC = float;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_176,_32>;
+  using Shape_MNK = Shape<_64,_16,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<176, 32>;
-  using CLayout = GMMA::CLayout_64x176;
+  using BLayout = GMMA::ABLayout< 16, 32>;
+  using CLayout = GMMA::CLayout_64x16;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x16x32_F32E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x16x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x176x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x16x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e4m3_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
   using ValTypeC = float;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_176,_32>;
+  using Shape_MNK = Shape<_64,_16,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<176, 32>;
-  using CLayout = GMMA::CLayout_64x176;
+  using BLayout = GMMA::ABLayout< 16, 32>;
+  using CLayout = GMMA::CLayout_64x16;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x32x32_F16E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x32x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x192x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x32x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
-  using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e4m3_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
   using ValTypeC = half_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_192,_32>;
+  using Shape_MNK = Shape<_64,_32,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<192, 32>;
-  using CLayout = GMMA::CLayout_64x192;
+  using BLayout = GMMA::ABLayout< 32, 32>;
+  using CLayout = GMMA::CLayout_64x32;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x32x32_F16E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x32x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x192x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x32x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
-  using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e4m3_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
   using ValTypeC = half_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_192,_32>;
+  using Shape_MNK = Shape<_64,_32,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<192, 32>;
-  using CLayout = GMMA::CLayout_64x192;
+  using BLayout = GMMA::ABLayout< 32, 32>;
+  using CLayout = GMMA::CLayout_64x32;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x32x32_F32E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x32x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x192x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x32x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e4m3_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
   using ValTypeC = float;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_192,_32>;
+  using Shape_MNK = Shape<_64,_32,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<192, 32>;
-  using CLayout = GMMA::CLayout_64x192;
+  using BLayout = GMMA::ABLayout< 32, 32>;
+  using CLayout = GMMA::CLayout_64x32;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x32x32_F32E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x32x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x192x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x32x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e4m3_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
   using ValTypeC = float;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_192,_32>;
+  using Shape_MNK = Shape<_64,_32,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<192, 32>;
-  using CLayout = GMMA::CLayout_64x192;
+  using BLayout = GMMA::ABLayout< 32, 32>;
+  using CLayout = GMMA::CLayout_64x32;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x64x32_F16E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x64x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x208x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x64x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
-  using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e4m3_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
   using ValTypeC = half_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_208,_32>;
+  using Shape_MNK = Shape<_64,_64,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<208, 32>;
-  using CLayout = GMMA::CLayout_64x208;
+  using BLayout = GMMA::ABLayout< 64, 32>;
+  using CLayout = GMMA::CLayout_64x64;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x64x32_F16E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x64x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x208x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x64x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
-  using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e4m3_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
   using ValTypeC = half_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_208,_32>;
+  using Shape_MNK = Shape<_64,_64,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<208, 32>;
-  using CLayout = GMMA::CLayout_64x208;
+  using BLayout = GMMA::ABLayout< 64, 32>;
+  using CLayout = GMMA::CLayout_64x64;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x64x32_F32E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x64x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x208x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x64x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e4m3_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
   using ValTypeC = float;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_208,_32>;
+  using Shape_MNK = Shape<_64,_64,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<208, 32>;
-  using CLayout = GMMA::CLayout_64x208;
+  using BLayout = GMMA::ABLayout< 64, 32>;
+  using CLayout = GMMA::CLayout_64x64;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x64x32_F32E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x64x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x208x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x64x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e4m3_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
   using ValTypeC = float;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_208,_32>;
+  using Shape_MNK = Shape<_64,_64,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<208, 32>;
-  using CLayout = GMMA::CLayout_64x208;
+  using BLayout = GMMA::ABLayout< 64, 32>;
+  using CLayout = GMMA::CLayout_64x64;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x96x32_F16E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x96x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x224x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x96x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
-  using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e4m3_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
   using ValTypeC = half_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_224,_32>;
+  using Shape_MNK = Shape<_64,_96,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<224, 32>;
-  using CLayout = GMMA::CLayout_64x224;
+  using BLayout = GMMA::ABLayout< 96, 32>;
+  using CLayout = GMMA::CLayout_64x96;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x96x32_F16E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x96x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x224x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x96x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
-  using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e4m3_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
   using ValTypeC = half_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_224,_32>;
+  using Shape_MNK = Shape<_64,_96,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<224, 32>;
-  using CLayout = GMMA::CLayout_64x224;
+  using BLayout = GMMA::ABLayout< 96, 32>;
+  using CLayout = GMMA::CLayout_64x96;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x96x32_F32E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x96x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x224x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x96x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e4m3_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
   using ValTypeC = float;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_224,_32>;
+  using Shape_MNK = Shape<_64,_96,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<224, 32>;
-  using CLayout = GMMA::CLayout_64x224;
+  using BLayout = GMMA::ABLayout< 96, 32>;
+  using CLayout = GMMA::CLayout_64x96;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x96x32_F32E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x96x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x224x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x96x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e4m3_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
   using ValTypeC = float;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_224,_32>;
+  using Shape_MNK = Shape<_64,_96,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<224, 32>;
-  using CLayout = GMMA::CLayout_64x224;
+  using BLayout = GMMA::ABLayout< 96, 32>;
+  using CLayout = GMMA::CLayout_64x96;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x128x32_F16E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x128x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x240x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x128x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
-  using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e4m3_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
   using ValTypeC = half_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_240,_32>;
+  using Shape_MNK = Shape<_64,_128,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<240, 32>;
-  using CLayout = GMMA::CLayout_64x240;
+  using BLayout = GMMA::ABLayout<128, 32>;
+  using CLayout = GMMA::CLayout_64x128;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x128x32_F16E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x128x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x240x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x128x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
-  using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e4m3_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
   using ValTypeC = half_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_240,_32>;
+  using Shape_MNK = Shape<_64,_128,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<240, 32>;
-  using CLayout = GMMA::CLayout_64x240;
+  using BLayout = GMMA::ABLayout<128, 32>;
+  using CLayout = GMMA::CLayout_64x128;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x128x32_F32E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x128x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x240x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x128x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e4m3_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
   using ValTypeC = float;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_240,_32>;
+  using Shape_MNK = Shape<_64,_128,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<240, 32>;
-  using CLayout = GMMA::CLayout_64x240;
+  using BLayout = GMMA::ABLayout<128, 32>;
+  using CLayout = GMMA::CLayout_64x128;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x128x32_F32E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x128x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x240x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x128x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e4m3_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
   using ValTypeC = float;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_240,_32>;
+  using Shape_MNK = Shape<_64,_128,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<240, 32>;
-  using CLayout = GMMA::CLayout_64x240;
+  using BLayout = GMMA::ABLayout<128, 32>;
+  using CLayout = GMMA::CLayout_64x128;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x192x32_F16E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x192x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x256x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x192x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
-  using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e4m3_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
   using ValTypeC = half_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_256,_32>;
+  using Shape_MNK = Shape<_64,_192,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<256, 32>;
-  using CLayout = GMMA::CLayout_64x256;
+  using BLayout = GMMA::ABLayout<192, 32>;
+  using CLayout = GMMA::CLayout_64x192;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x192x32_F16E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x192x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x256x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x192x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
-  using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e4m3_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
   using ValTypeC = half_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_256,_32>;
+  using Shape_MNK = Shape<_64,_192,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<256, 32>;
-  using CLayout = GMMA::CLayout_64x256;
+  using BLayout = GMMA::ABLayout<192, 32>;
+  using CLayout = GMMA::CLayout_64x192;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x192x32_F32E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x192x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x256x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x192x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e4m3_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
   using ValTypeC = float;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_256,_32>;
+  using Shape_MNK = Shape<_64,_192,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<256, 32>;
-  using CLayout = GMMA::CLayout_64x256;
+  using BLayout = GMMA::ABLayout<192, 32>;
+  using CLayout = GMMA::CLayout_64x192;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x192x32_F32E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x192x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x256x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x192x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e4m3_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
   using ValTypeC = float;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_256,_32>;
+  using Shape_MNK = Shape<_64,_192,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<256, 32>;
-  using CLayout = GMMA::CLayout_64x256;
+  using BLayout = GMMA::ABLayout<192, 32>;
+  using CLayout = GMMA::CLayout_64x192;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x256x32_F16E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x256x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x8x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x256x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
-  using ValTypeA = float_e5m2_t;
+  using ValTypeA = float_e4m3_t;
   using ValTypeB = float_e5m2_t;
   using ValTypeC = half_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_8,_32>;
+  using Shape_MNK = Shape<_64,_256,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<  8, 32>;
-  using CLayout = GMMA::CLayout_64x8;
+  using BLayout = GMMA::ABLayout<256, 32>;
+  using CLayout = GMMA::CLayout_64x256;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x256x32_F16E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x256x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x8x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x256x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
-  using ValTypeA = float_e5m2_t;
+  using ValTypeA = float_e4m3_t;
   using ValTypeB = float_e5m2_t;
   using ValTypeC = half_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_8,_32>;
+  using Shape_MNK = Shape<_64,_256,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<  8, 32>;
-  using CLayout = GMMA::CLayout_64x8;
+  using BLayout = GMMA::ABLayout<256, 32>;
+  using CLayout = GMMA::CLayout_64x256;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x256x32_F32E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x256x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x8x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x256x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = float_e5m2_t;
+  using ValTypeA = float_e4m3_t;
   using ValTypeB = float_e5m2_t;
   using ValTypeC = float;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_8,_32>;
+  using Shape_MNK = Shape<_64,_256,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<  8, 32>;
-  using CLayout = GMMA::CLayout_64x8;
+  using BLayout = GMMA::ABLayout<256, 32>;
+  using CLayout = GMMA::CLayout_64x256;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x256x32_F32E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x256x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x8x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x256x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
-  using ValTypeA = float_e5m2_t;
+  using ValTypeA = float_e4m3_t;
   using ValTypeB = float_e5m2_t;
   using ValTypeC = float;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_8,_32>;
+  using Shape_MNK = Shape<_64,_256,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<  8, 32>;
-  using CLayout = GMMA::CLayout_64x8;
+  using BLayout = GMMA::ABLayout<256, 32>;
+  using CLayout = GMMA::CLayout_64x256;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x8x32_F16E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x8x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x16x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x8x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
   using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
   using ValTypeC = half_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_16,_32>;
+  using Shape_MNK = Shape<_64,_8,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 16, 32>;
-  using CLayout = GMMA::CLayout_64x16;
+  using BLayout = GMMA::ABLayout<  8, 32>;
+  using CLayout = GMMA::CLayout_64x8;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x8x32_F16E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x8x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x16x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x8x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
   using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
   using ValTypeC = half_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_16,_32>;
+  using Shape_MNK = Shape<_64,_8,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 16, 32>;
-  using CLayout = GMMA::CLayout_64x16;
+  using BLayout = GMMA::ABLayout<  8, 32>;
+  using CLayout = GMMA::CLayout_64x8;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x8x32_F32E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x8x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x16x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x8x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
   using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
   using ValTypeC = float;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_16,_32>;
+  using Shape_MNK = Shape<_64,_8,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 16, 32>;
-  using CLayout = GMMA::CLayout_64x16;
+  using BLayout = GMMA::ABLayout<  8, 32>;
+  using CLayout = GMMA::CLayout_64x8;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x8x32_F32E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x8x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x16x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x8x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
   using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
   using ValTypeC = float;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_16,_32>;
+  using Shape_MNK = Shape<_64,_8,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 16, 32>;
-  using CLayout = GMMA::CLayout_64x16;
+  using BLayout = GMMA::ABLayout<  8, 32>;
+  using CLayout = GMMA::CLayout_64x8;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x16x32_F16E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x16x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x32x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x16x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
   using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
   using ValTypeC = half_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_32,_32>;
+  using Shape_MNK = Shape<_64,_16,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 32, 32>;
-  using CLayout = GMMA::CLayout_64x32;
+  using BLayout = GMMA::ABLayout< 16, 32>;
+  using CLayout = GMMA::CLayout_64x16;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x16x32_F16E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x16x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x32x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x16x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
   using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
   using ValTypeC = half_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_32,_32>;
+  using Shape_MNK = Shape<_64,_16,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 32, 32>;
-  using CLayout = GMMA::CLayout_64x32;
+  using BLayout = GMMA::ABLayout< 16, 32>;
+  using CLayout = GMMA::CLayout_64x16;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x16x32_F32E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x16x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x32x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x16x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
   using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
   using ValTypeC = float;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_32,_32>;
+  using Shape_MNK = Shape<_64,_16,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 32, 32>;
-  using CLayout = GMMA::CLayout_64x32;
+  using BLayout = GMMA::ABLayout< 16, 32>;
+  using CLayout = GMMA::CLayout_64x16;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x16x32_F32E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x16x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x32x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x16x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
   using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
   using ValTypeC = float;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_32,_32>;
+  using Shape_MNK = Shape<_64,_16,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 32, 32>;
-  using CLayout = GMMA::CLayout_64x32;
+  using BLayout = GMMA::ABLayout< 16, 32>;
+  using CLayout = GMMA::CLayout_64x16;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x32x32_F16E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x32x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x48x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x32x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
   using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
   using ValTypeC = half_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_48,_32>;
+  using Shape_MNK = Shape<_64,_32,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 48, 32>;
-  using CLayout = GMMA::CLayout_64x48;
+  using BLayout = GMMA::ABLayout< 32, 32>;
+  using CLayout = GMMA::CLayout_64x32;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x32x32_F16E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x32x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x48x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x32x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
   using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
   using ValTypeC = half_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_48,_32>;
+  using Shape_MNK = Shape<_64,_32,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 48, 32>;
-  using CLayout = GMMA::CLayout_64x48;
+  using BLayout = GMMA::ABLayout< 32, 32>;
+  using CLayout = GMMA::CLayout_64x32;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x32x32_F32E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x32x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x48x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x32x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
   using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
   using ValTypeC = float;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_48,_32>;
+  using Shape_MNK = Shape<_64,_32,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 48, 32>;
-  using CLayout = GMMA::CLayout_64x48;
+  using BLayout = GMMA::ABLayout< 32, 32>;
+  using CLayout = GMMA::CLayout_64x32;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x32x32_F32E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x32x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x48x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x32x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
   using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
   using ValTypeC = float;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_48,_32>;
+  using Shape_MNK = Shape<_64,_32,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 48, 32>;
-  using CLayout = GMMA::CLayout_64x48;
+  using BLayout = GMMA::ABLayout< 32, 32>;
+  using CLayout = GMMA::CLayout_64x32;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x64x32_F16E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x64x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x64x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x64x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
   using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
   using ValTypeC = half_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
@@ -14629,12 +7590,18 @@ struct MMA_Traits<SM90_64x64x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x64x32_F16E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x64x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x64x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x64x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
   using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
   using ValTypeC = half_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
@@ -14650,12 +7617,18 @@ struct MMA_Traits<SM90_64x64x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x64x32_F32E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x64x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x64x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x64x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
   using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
   using ValTypeC = float;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
@@ -14672,12 +7645,18 @@ struct MMA_Traits<SM90_64x64x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x64x32_F32E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x64x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x64x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x64x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
   using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
   using ValTypeC = float;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
@@ -14693,369 +7672,454 @@ struct MMA_Traits<SM90_64x64x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x96x32_F16E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x96x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x80x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x96x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
   using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
   using ValTypeC = half_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_80,_32>;
+  using Shape_MNK = Shape<_64,_96,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 80, 32>;
-  using CLayout = GMMA::CLayout_64x80;
+  using BLayout = GMMA::ABLayout< 96, 32>;
+  using CLayout = GMMA::CLayout_64x96;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x96x32_F16E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x96x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x80x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x96x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
   using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
   using ValTypeC = half_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_80,_32>;
+  using Shape_MNK = Shape<_64,_96,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 80, 32>;
-  using CLayout = GMMA::CLayout_64x80;
+  using BLayout = GMMA::ABLayout< 96, 32>;
+  using CLayout = GMMA::CLayout_64x96;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x96x32_F32E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x96x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x80x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x96x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
   using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
   using ValTypeC = float;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_80,_32>;
+  using Shape_MNK = Shape<_64,_96,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 80, 32>;
-  using CLayout = GMMA::CLayout_64x80;
+  using BLayout = GMMA::ABLayout< 96, 32>;
+  using CLayout = GMMA::CLayout_64x96;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x96x32_F32E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x96x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x80x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x96x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
   using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
   using ValTypeC = float;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_80,_32>;
+  using Shape_MNK = Shape<_64,_96,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 80, 32>;
-  using CLayout = GMMA::CLayout_64x80;
+  using BLayout = GMMA::ABLayout< 96, 32>;
+  using CLayout = GMMA::CLayout_64x96;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x128x32_F16E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x128x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x96x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x128x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
   using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
   using ValTypeC = half_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_96,_32>;
+  using Shape_MNK = Shape<_64,_128,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 96, 32>;
-  using CLayout = GMMA::CLayout_64x96;
+  using BLayout = GMMA::ABLayout<128, 32>;
+  using CLayout = GMMA::CLayout_64x128;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x128x32_F16E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x128x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x96x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x128x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
   using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
   using ValTypeC = half_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_96,_32>;
+  using Shape_MNK = Shape<_64,_128,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 96, 32>;
-  using CLayout = GMMA::CLayout_64x96;
+  using BLayout = GMMA::ABLayout<128, 32>;
+  using CLayout = GMMA::CLayout_64x128;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x128x32_F32E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x128x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x96x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x128x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
   using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
   using ValTypeC = float;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_96,_32>;
+  using Shape_MNK = Shape<_64,_128,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout< 96, 32>;
-  using CLayout = GMMA::CLayout_64x96;
+  using BLayout = GMMA::ABLayout<128, 32>;
+  using CLayout = GMMA::CLayout_64x128;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x128x32_F32E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x128x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x96x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x128x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
   using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
   using ValTypeC = float;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_96,_32>;
+  using Shape_MNK = Shape<_64,_128,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout< 96, 32>;
-  using CLayout = GMMA::CLayout_64x96;
+  using BLayout = GMMA::ABLayout<128, 32>;
+  using CLayout = GMMA::CLayout_64x128;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x192x32_F16E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x192x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x112x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x192x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
   using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
   using ValTypeC = half_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_112,_32>;
+  using Shape_MNK = Shape<_64,_192,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<112, 32>;
-  using CLayout = GMMA::CLayout_64x112;
+  using BLayout = GMMA::ABLayout<192, 32>;
+  using CLayout = GMMA::CLayout_64x192;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x192x32_F16E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x192x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x112x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x192x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
   using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
   using ValTypeC = half_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_112,_32>;
+  using Shape_MNK = Shape<_64,_192,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<112, 32>;
-  using CLayout = GMMA::CLayout_64x112;
+  using BLayout = GMMA::ABLayout<192, 32>;
+  using CLayout = GMMA::CLayout_64x192;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x192x32_F32E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x192x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x112x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x192x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
   using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
   using ValTypeC = float;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_112,_32>;
+  using Shape_MNK = Shape<_64,_192,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<112, 32>;
-  using CLayout = GMMA::CLayout_64x112;
+  using BLayout = GMMA::ABLayout<192, 32>;
+  using CLayout = GMMA::CLayout_64x192;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x192x32_F32E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x192x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x112x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x192x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
   using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
   using ValTypeC = float;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_112,_32>;
+  using Shape_MNK = Shape<_64,_192,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<112, 32>;
-  using CLayout = GMMA::CLayout_64x112;
+  using BLayout = GMMA::ABLayout<192, 32>;
+  using CLayout = GMMA::CLayout_64x192;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x256x32_F16E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x256x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x128x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x256x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
   using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
   using ValTypeC = half_t;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_128,_32>;
+  using Shape_MNK = Shape<_64,_256,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<128, 32>;
-  using CLayout = GMMA::CLayout_64x128;
+  using BLayout = GMMA::ABLayout<256, 32>;
+  using CLayout = GMMA::CLayout_64x256;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x256x32_F16E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x256x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x128x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x256x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
   using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
   using ValTypeC = half_t;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_128,_32>;
+  using Shape_MNK = Shape<_64,_256,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<128, 32>;
-  using CLayout = GMMA::CLayout_64x128;
+  using BLayout = GMMA::ABLayout<256, 32>;
+  using CLayout = GMMA::CLayout_64x256;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x256x32_F32E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x256x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x128x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x256x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
   using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
   using ValTypeC = float;
 
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_128,_32>;
+  using Shape_MNK = Shape<_64,_256,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<128, 32>;
-  using CLayout = GMMA::CLayout_64x128;
+  using BLayout = GMMA::ABLayout<256, 32>;
+  using CLayout = GMMA::CLayout_64x256;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x256x32_F32E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x256x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x128x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x256x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
   using ValTypeA = float_e5m2_t;
-  using ValTypeB = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
   using ValTypeC = float;
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_128,_32>;
+  using Shape_MNK = Shape<_64,_256,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<128, 32>;
-  using CLayout = GMMA::CLayout_64x128;
+  using BLayout = GMMA::ABLayout<256, 32>;
+  using CLayout = GMMA::CLayout_64x256;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x8x32_F16E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x8x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x144x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x8x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
   using ValTypeA = float_e5m2_t;
@@ -15065,21 +8129,25 @@ struct MMA_Traits<SM90_64x144x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_144,_32>;
+  using Shape_MNK = Shape<_64,_8,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<144, 32>;
-  using CLayout = GMMA::CLayout_64x144;
+  using BLayout = GMMA::ABLayout<  8, 32>;
+  using CLayout = GMMA::CLayout_64x8;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x8x32_F16E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x8x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x144x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x8x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
   using ValTypeA = float_e5m2_t;
@@ -15088,21 +8156,25 @@ struct MMA_Traits<SM90_64x144x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_144,_32>;
+  using Shape_MNK = Shape<_64,_8,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<144, 32>;
-  using CLayout = GMMA::CLayout_64x144;
+  using BLayout = GMMA::ABLayout<  8, 32>;
+  using CLayout = GMMA::CLayout_64x8;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x8x32_F32E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x8x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x144x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x8x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
   using ValTypeA = float_e5m2_t;
@@ -15112,21 +8184,25 @@ struct MMA_Traits<SM90_64x144x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_144,_32>;
+  using Shape_MNK = Shape<_64,_8,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<144, 32>;
-  using CLayout = GMMA::CLayout_64x144;
+  using BLayout = GMMA::ABLayout<  8, 32>;
+  using CLayout = GMMA::CLayout_64x8;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x8x32_F32E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x8x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x144x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x8x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
   using ValTypeA = float_e5m2_t;
@@ -15135,21 +8211,25 @@ struct MMA_Traits<SM90_64x144x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_144,_32>;
+  using Shape_MNK = Shape<_64,_8,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<144, 32>;
-  using CLayout = GMMA::CLayout_64x144;
+  using BLayout = GMMA::ABLayout<  8, 32>;
+  using CLayout = GMMA::CLayout_64x8;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x16x32_F16E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x16x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x160x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x16x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
   using ValTypeA = float_e5m2_t;
@@ -15159,21 +8239,25 @@ struct MMA_Traits<SM90_64x160x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_160,_32>;
+  using Shape_MNK = Shape<_64,_16,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<160, 32>;
-  using CLayout = GMMA::CLayout_64x160;
+  using BLayout = GMMA::ABLayout< 16, 32>;
+  using CLayout = GMMA::CLayout_64x16;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x16x32_F16E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x16x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x160x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x16x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
   using ValTypeA = float_e5m2_t;
@@ -15182,21 +8266,25 @@ struct MMA_Traits<SM90_64x160x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_160,_32>;
+  using Shape_MNK = Shape<_64,_16,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<160, 32>;
-  using CLayout = GMMA::CLayout_64x160;
+  using BLayout = GMMA::ABLayout< 16, 32>;
+  using CLayout = GMMA::CLayout_64x16;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x16x32_F32E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x16x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x160x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x16x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
   using ValTypeA = float_e5m2_t;
@@ -15206,21 +8294,25 @@ struct MMA_Traits<SM90_64x160x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_160,_32>;
+  using Shape_MNK = Shape<_64,_16,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<160, 32>;
-  using CLayout = GMMA::CLayout_64x160;
+  using BLayout = GMMA::ABLayout< 16, 32>;
+  using CLayout = GMMA::CLayout_64x16;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x16x32_F32E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x16x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x160x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x16x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
   using ValTypeA = float_e5m2_t;
@@ -15229,21 +8321,25 @@ struct MMA_Traits<SM90_64x160x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_160,_32>;
+  using Shape_MNK = Shape<_64,_16,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<160, 32>;
-  using CLayout = GMMA::CLayout_64x160;
+  using BLayout = GMMA::ABLayout< 16, 32>;
+  using CLayout = GMMA::CLayout_64x16;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x32x32_F16E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x32x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x176x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x32x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
   using ValTypeA = float_e5m2_t;
@@ -15253,21 +8349,25 @@ struct MMA_Traits<SM90_64x176x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_176,_32>;
+  using Shape_MNK = Shape<_64,_32,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<176, 32>;
-  using CLayout = GMMA::CLayout_64x176;
+  using BLayout = GMMA::ABLayout< 32, 32>;
+  using CLayout = GMMA::CLayout_64x32;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x32x32_F16E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x32x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x176x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x32x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
   using ValTypeA = float_e5m2_t;
@@ -15276,21 +8376,25 @@ struct MMA_Traits<SM90_64x176x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_176,_32>;
+  using Shape_MNK = Shape<_64,_32,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<176, 32>;
-  using CLayout = GMMA::CLayout_64x176;
+  using BLayout = GMMA::ABLayout< 32, 32>;
+  using CLayout = GMMA::CLayout_64x32;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x32x32_F32E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x32x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x176x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x32x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
   using ValTypeA = float_e5m2_t;
@@ -15300,21 +8404,25 @@ struct MMA_Traits<SM90_64x176x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_176,_32>;
+  using Shape_MNK = Shape<_64,_32,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<176, 32>;
-  using CLayout = GMMA::CLayout_64x176;
+  using BLayout = GMMA::ABLayout< 32, 32>;
+  using CLayout = GMMA::CLayout_64x32;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x32x32_F32E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x32x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x176x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x32x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
   using ValTypeA = float_e5m2_t;
@@ -15323,20 +8431,25 @@ struct MMA_Traits<SM90_64x176x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_176,_32>;
+  using Shape_MNK = Shape<_64,_32,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<176, 32>;
-  using CLayout = GMMA::CLayout_64x176;
+  using BLayout = GMMA::ABLayout< 32, 32>;
+  using CLayout = GMMA::CLayout_64x32;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x64x32_F16E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x64x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x192x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x64x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
   using ValTypeA = float_e5m2_t;
@@ -15346,19 +8459,25 @@ struct MMA_Traits<SM90_64x192x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_192,_32>;
+  using Shape_MNK = Shape<_64,_64,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<192, 32>;
-  using CLayout = GMMA::CLayout_64x192;
+  using BLayout = GMMA::ABLayout< 64, 32>;
+  using CLayout = GMMA::CLayout_64x64;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x64x32_F16E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x64x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x192x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x64x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
   using ValTypeA = float_e5m2_t;
@@ -15367,19 +8486,25 @@ struct MMA_Traits<SM90_64x192x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_192,_32>;
+  using Shape_MNK = Shape<_64,_64,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<192, 32>;
-  using CLayout = GMMA::CLayout_64x192;
+  using BLayout = GMMA::ABLayout< 64, 32>;
+  using CLayout = GMMA::CLayout_64x64;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x64x32_F32E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x64x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x192x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x64x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
   using ValTypeA = float_e5m2_t;
@@ -15389,19 +8514,25 @@ struct MMA_Traits<SM90_64x192x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_192,_32>;
+  using Shape_MNK = Shape<_64,_64,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<192, 32>;
-  using CLayout = GMMA::CLayout_64x192;
+  using BLayout = GMMA::ABLayout< 64, 32>;
+  using CLayout = GMMA::CLayout_64x64;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x64x32_F32E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x64x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x192x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x64x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
   using ValTypeA = float_e5m2_t;
@@ -15410,20 +8541,25 @@ struct MMA_Traits<SM90_64x192x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_192,_32>;
+  using Shape_MNK = Shape<_64,_64,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<192, 32>;
-  using CLayout = GMMA::CLayout_64x192;
+  using BLayout = GMMA::ABLayout< 64, 32>;
+  using CLayout = GMMA::CLayout_64x64;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x96x32_F16E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x96x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x208x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x96x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
   using ValTypeA = float_e5m2_t;
@@ -15433,21 +8569,25 @@ struct MMA_Traits<SM90_64x208x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_208,_32>;
+  using Shape_MNK = Shape<_64,_96,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<208, 32>;
-  using CLayout = GMMA::CLayout_64x208;
+  using BLayout = GMMA::ABLayout< 96, 32>;
+  using CLayout = GMMA::CLayout_64x96;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x96x32_F16E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x96x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x208x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x96x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
   using ValTypeA = float_e5m2_t;
@@ -15456,21 +8596,25 @@ struct MMA_Traits<SM90_64x208x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_208,_32>;
+  using Shape_MNK = Shape<_64,_96,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<208, 32>;
-  using CLayout = GMMA::CLayout_64x208;
+  using BLayout = GMMA::ABLayout< 96, 32>;
+  using CLayout = GMMA::CLayout_64x96;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x96x32_F32E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x96x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x208x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x96x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
   using ValTypeA = float_e5m2_t;
@@ -15480,21 +8624,25 @@ struct MMA_Traits<SM90_64x208x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_208,_32>;
+  using Shape_MNK = Shape<_64,_96,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<208, 32>;
-  using CLayout = GMMA::CLayout_64x208;
+  using BLayout = GMMA::ABLayout< 96, 32>;
+  using CLayout = GMMA::CLayout_64x96;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x96x32_F32E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x96x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x208x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x96x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
   using ValTypeA = float_e5m2_t;
@@ -15503,21 +8651,25 @@ struct MMA_Traits<SM90_64x208x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_208,_32>;
+  using Shape_MNK = Shape<_64,_96,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<208, 32>;
-  using CLayout = GMMA::CLayout_64x208;
+  using BLayout = GMMA::ABLayout< 96, 32>;
+  using CLayout = GMMA::CLayout_64x96;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x128x32_F16E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x128x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x224x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x128x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
   using ValTypeA = float_e5m2_t;
@@ -15527,21 +8679,25 @@ struct MMA_Traits<SM90_64x224x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_224,_32>;
+  using Shape_MNK = Shape<_64,_128,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<224, 32>;
-  using CLayout = GMMA::CLayout_64x224;
+  using BLayout = GMMA::ABLayout<128, 32>;
+  using CLayout = GMMA::CLayout_64x128;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x128x32_F16E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x128x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x224x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x128x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
   using ValTypeA = float_e5m2_t;
@@ -15550,21 +8706,25 @@ struct MMA_Traits<SM90_64x224x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_224,_32>;
+  using Shape_MNK = Shape<_64,_128,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<224, 32>;
-  using CLayout = GMMA::CLayout_64x224;
+  using BLayout = GMMA::ABLayout<128, 32>;
+  using CLayout = GMMA::CLayout_64x128;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x128x32_F32E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x128x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x224x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x128x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
   using ValTypeA = float_e5m2_t;
@@ -15574,21 +8734,25 @@ struct MMA_Traits<SM90_64x224x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_224,_32>;
+  using Shape_MNK = Shape<_64,_128,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<224, 32>;
-  using CLayout = GMMA::CLayout_64x224;
+  using BLayout = GMMA::ABLayout<128, 32>;
+  using CLayout = GMMA::CLayout_64x128;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x128x32_F32E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x128x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x224x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x128x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
   using ValTypeA = float_e5m2_t;
@@ -15597,21 +8761,25 @@ struct MMA_Traits<SM90_64x224x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_224,_32>;
+  using Shape_MNK = Shape<_64,_128,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<224, 32>;
-  using CLayout = GMMA::CLayout_64x224;
+  using BLayout = GMMA::ABLayout<128, 32>;
+  using CLayout = GMMA::CLayout_64x128;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x192x32_F16E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x192x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x240x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x192x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
   using ValTypeA = float_e5m2_t;
@@ -15621,21 +8789,25 @@ struct MMA_Traits<SM90_64x240x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_240,_32>;
+  using Shape_MNK = Shape<_64,_192,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<240, 32>;
-  using CLayout = GMMA::CLayout_64x240;
+  using BLayout = GMMA::ABLayout<192, 32>;
+  using CLayout = GMMA::CLayout_64x192;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x192x32_F16E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x192x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x240x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x192x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = half_t;
   using ValTypeA = float_e5m2_t;
@@ -15644,21 +8816,25 @@ struct MMA_Traits<SM90_64x240x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_240,_32>;
+  using Shape_MNK = Shape<_64,_192,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<240, 32>;
-  using CLayout = GMMA::CLayout_64x240;
+  using BLayout = GMMA::ABLayout<192, 32>;
+  using CLayout = GMMA::CLayout_64x192;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x192x32_F32E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x192x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x240x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x192x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
   using ValTypeA = float_e5m2_t;
@@ -15668,21 +8844,25 @@ struct MMA_Traits<SM90_64x240x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
   using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_240,_32>;
+  using Shape_MNK = Shape<_64,_192,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ABLayout< 64, 32>;
-  using BLayout = GMMA::ABLayout<240, 32>;
-  using CLayout = GMMA::CLayout_64x240;
+  using BLayout = GMMA::ABLayout<192, 32>;
+  using CLayout = GMMA::CLayout_64x192;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
-#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x192x32_F32E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x192x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
-struct MMA_Traits<SM90_64x240x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
+struct MMA_Traits<SM90_64x192x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
 {
   using ValTypeD = float;
   using ValTypeA = float_e5m2_t;
@@ -15691,18 +8871,23 @@ struct MMA_Traits<SM90_64x240x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
 
   using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
 
-  using Shape_MNK = Shape<_64,_240,_32>;
+  using Shape_MNK = Shape<_64,_192,_32>;
   using ThrID   = Layout<_128>;
   using ALayout = GMMA::ALayout_64x32;
-  using BLayout = GMMA::ABLayout<240, 32>;
-  using CLayout = GMMA::CLayout_64x240;
+  using BLayout = GMMA::ABLayout<192, 32>;
+  using CLayout = GMMA::CLayout_64x192;
 
   GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
 };
-#endif
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x256x32_F16E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x256x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
 struct MMA_Traits<SM90_64x256x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
 {
@@ -15725,6 +8910,12 @@ struct MMA_Traits<SM90_64x256x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x256x32_F16E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x256x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
 struct MMA_Traits<SM90_64x256x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
 {
@@ -15746,6 +8937,12 @@ struct MMA_Traits<SM90_64x256x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x256x32_F32E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x256x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
 struct MMA_Traits<SM90_64x256x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
 {
@@ -15768,6 +8965,12 @@ struct MMA_Traits<SM90_64x256x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x256x32_F32E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x256x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>;
+
 template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
 struct MMA_Traits<SM90_64x256x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
 {
@@ -15790,3 +8993,7 @@ struct MMA_Traits<SM90_64x256x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
 } // end namespace cute
+
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+#include "mma_traits_sm90_gmma_ext.hpp"
+#endif
diff --git a/include/cute/atom/mma_traits_sm90_gmma_ext.hpp b/include/cute/atom/mma_traits_sm90_gmma_ext.hpp
new file mode 100644
index 0000000000..15e2412c87
--- /dev/null
+++ b/include/cute/atom/mma_traits_sm90_gmma_ext.hpp
@@ -0,0 +1,20116 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+ 
+#pragma once
+  
+#include <cute/arch/mma_sm90.hpp>
+#include <cute/atom/mma_traits.hpp>
+
+namespace cute {
+
+namespace SM90::GMMA {
+
+using CLayout_64x24  = CLayout_64xN< 24>;
+using CLayout_64x40  = CLayout_64xN< 40>;
+using CLayout_64x48  = CLayout_64xN< 48>;
+using CLayout_64x56  = CLayout_64xN< 56>;
+using CLayout_64x72  = CLayout_64xN< 72>;
+using CLayout_64x80  = CLayout_64xN< 80>;
+using CLayout_64x88  = CLayout_64xN< 88>;
+using CLayout_64x104 = CLayout_64xN<104>;
+using CLayout_64x112 = CLayout_64xN<112>;
+using CLayout_64x120 = CLayout_64xN<120>;
+using CLayout_64x136 = CLayout_64xN<136>;
+using CLayout_64x144 = CLayout_64xN<144>;
+using CLayout_64x152 = CLayout_64xN<152>;
+using CLayout_64x160 = CLayout_64xN<160>;
+using CLayout_64x168 = CLayout_64xN<168>;
+using CLayout_64x176 = CLayout_64xN<176>;
+using CLayout_64x184 = CLayout_64xN<184>;
+using CLayout_64x200 = CLayout_64xN<200>;
+using CLayout_64x208 = CLayout_64xN<208>;
+using CLayout_64x216 = CLayout_64xN<216>;
+using CLayout_64x224 = CLayout_64xN<224>;
+using CLayout_64x232 = CLayout_64xN<232>;
+using CLayout_64x240 = CLayout_64xN<240>;
+using CLayout_64x248 = CLayout_64xN<248>;
+
+}
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x24x16_F16F16F16_SS = SM90::GMMA::MMA_64x24x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x24x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_24,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout< 24, 16>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x24x16_F16F16F16_RS = SM90::GMMA::MMA_64x24x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x24x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_24,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout< 24, 16>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x40x16_F16F16F16_SS = SM90::GMMA::MMA_64x40x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x40x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_40,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout< 40, 16>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x40x16_F16F16F16_RS = SM90::GMMA::MMA_64x40x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x40x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_40,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout< 40, 16>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x48x16_F16F16F16_SS = SM90::GMMA::MMA_64x48x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x48x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_48,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout< 48, 16>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x48x16_F16F16F16_RS = SM90::GMMA::MMA_64x48x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x48x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_48,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout< 48, 16>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x56x16_F16F16F16_SS = SM90::GMMA::MMA_64x56x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x56x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_56,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout< 56, 16>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x56x16_F16F16F16_RS = SM90::GMMA::MMA_64x56x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x56x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_56,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout< 56, 16>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x72x16_F16F16F16_SS = SM90::GMMA::MMA_64x72x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x72x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_72,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout< 72, 16>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x72x16_F16F16F16_RS = SM90::GMMA::MMA_64x72x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x72x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_72,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout< 72, 16>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x80x16_F16F16F16_SS = SM90::GMMA::MMA_64x80x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x80x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_80,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout< 80, 16>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x80x16_F16F16F16_RS = SM90::GMMA::MMA_64x80x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x80x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_80,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout< 80, 16>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x88x16_F16F16F16_SS = SM90::GMMA::MMA_64x88x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x88x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_88,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout< 88, 16>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x88x16_F16F16F16_RS = SM90::GMMA::MMA_64x88x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x88x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_88,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout< 88, 16>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x104x16_F16F16F16_SS = SM90::GMMA::MMA_64x104x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x104x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_104,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<104, 16>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x104x16_F16F16F16_RS = SM90::GMMA::MMA_64x104x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x104x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_104,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<104, 16>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x112x16_F16F16F16_SS = SM90::GMMA::MMA_64x112x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x112x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_112,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<112, 16>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x112x16_F16F16F16_RS = SM90::GMMA::MMA_64x112x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x112x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_112,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<112, 16>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x120x16_F16F16F16_SS = SM90::GMMA::MMA_64x120x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x120x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_120,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<120, 16>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x120x16_F16F16F16_RS = SM90::GMMA::MMA_64x120x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x120x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_120,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<120, 16>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x136x16_F16F16F16_SS = SM90::GMMA::MMA_64x136x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x136x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_136,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<136, 16>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x136x16_F16F16F16_RS = SM90::GMMA::MMA_64x136x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x136x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_136,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<136, 16>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x144x16_F16F16F16_SS = SM90::GMMA::MMA_64x144x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x144x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_144,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<144, 16>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x144x16_F16F16F16_RS = SM90::GMMA::MMA_64x144x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x144x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_144,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<144, 16>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x152x16_F16F16F16_SS = SM90::GMMA::MMA_64x152x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x152x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_152,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<152, 16>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x152x16_F16F16F16_RS = SM90::GMMA::MMA_64x152x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x152x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_152,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<152, 16>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x160x16_F16F16F16_SS = SM90::GMMA::MMA_64x160x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x160x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_160,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<160, 16>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x160x16_F16F16F16_RS = SM90::GMMA::MMA_64x160x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x160x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_160,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<160, 16>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x168x16_F16F16F16_SS = SM90::GMMA::MMA_64x168x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x168x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_168,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<168, 16>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x168x16_F16F16F16_RS = SM90::GMMA::MMA_64x168x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x168x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_168,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<168, 16>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x176x16_F16F16F16_SS = SM90::GMMA::MMA_64x176x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x176x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_176,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<176, 16>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x176x16_F16F16F16_RS = SM90::GMMA::MMA_64x176x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x176x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_176,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<176, 16>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x184x16_F16F16F16_SS = SM90::GMMA::MMA_64x184x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x184x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_184,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<184, 16>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x184x16_F16F16F16_RS = SM90::GMMA::MMA_64x184x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x184x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_184,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<184, 16>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x200x16_F16F16F16_SS = SM90::GMMA::MMA_64x200x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x200x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_200,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<200, 16>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x200x16_F16F16F16_RS = SM90::GMMA::MMA_64x200x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x200x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_200,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<200, 16>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x208x16_F16F16F16_SS = SM90::GMMA::MMA_64x208x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x208x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_208,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<208, 16>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x208x16_F16F16F16_RS = SM90::GMMA::MMA_64x208x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x208x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_208,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<208, 16>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x216x16_F16F16F16_SS = SM90::GMMA::MMA_64x216x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x216x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_216,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<216, 16>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x216x16_F16F16F16_RS = SM90::GMMA::MMA_64x216x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x216x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_216,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<216, 16>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x224x16_F16F16F16_SS = SM90::GMMA::MMA_64x224x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x224x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_224,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<224, 16>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x224x16_F16F16F16_RS = SM90::GMMA::MMA_64x224x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x224x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_224,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<224, 16>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x232x16_F16F16F16_SS = SM90::GMMA::MMA_64x232x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x232x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_232,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<232, 16>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x232x16_F16F16F16_RS = SM90::GMMA::MMA_64x232x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x232x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_232,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<232, 16>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x240x16_F16F16F16_SS = SM90::GMMA::MMA_64x240x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x240x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_240,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<240, 16>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x240x16_F16F16F16_RS = SM90::GMMA::MMA_64x240x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x240x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_240,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<240, 16>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x248x16_F16F16F16_SS = SM90::GMMA::MMA_64x248x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x248x16_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_248,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<248, 16>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x248x16_F16F16F16_RS = SM90::GMMA::MMA_64x248x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x248x16_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_248,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<248, 16>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x24x16_F32F16F16_SS = SM90::GMMA::MMA_64x24x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x24x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_24,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout< 24, 16>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x24x16_F32F16F16_RS = SM90::GMMA::MMA_64x24x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x24x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_24,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout< 24, 16>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x40x16_F32F16F16_SS = SM90::GMMA::MMA_64x40x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x40x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_40,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout< 40, 16>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x40x16_F32F16F16_RS = SM90::GMMA::MMA_64x40x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x40x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_40,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout< 40, 16>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x48x16_F32F16F16_SS = SM90::GMMA::MMA_64x48x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x48x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_48,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout< 48, 16>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x48x16_F32F16F16_RS = SM90::GMMA::MMA_64x48x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x48x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_48,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout< 48, 16>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x56x16_F32F16F16_SS = SM90::GMMA::MMA_64x56x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x56x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_56,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout< 56, 16>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x56x16_F32F16F16_RS = SM90::GMMA::MMA_64x56x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x56x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_56,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout< 56, 16>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x72x16_F32F16F16_SS = SM90::GMMA::MMA_64x72x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x72x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_72,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout< 72, 16>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x72x16_F32F16F16_RS = SM90::GMMA::MMA_64x72x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x72x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_72,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout< 72, 16>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x80x16_F32F16F16_SS = SM90::GMMA::MMA_64x80x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x80x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_80,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout< 80, 16>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x80x16_F32F16F16_RS = SM90::GMMA::MMA_64x80x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x80x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_80,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout< 80, 16>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x88x16_F32F16F16_SS = SM90::GMMA::MMA_64x88x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x88x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_88,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout< 88, 16>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x88x16_F32F16F16_RS = SM90::GMMA::MMA_64x88x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x88x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_88,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout< 88, 16>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x104x16_F32F16F16_SS = SM90::GMMA::MMA_64x104x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x104x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_104,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<104, 16>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x104x16_F32F16F16_RS = SM90::GMMA::MMA_64x104x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x104x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_104,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<104, 16>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x112x16_F32F16F16_SS = SM90::GMMA::MMA_64x112x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x112x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_112,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<112, 16>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x112x16_F32F16F16_RS = SM90::GMMA::MMA_64x112x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x112x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_112,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<112, 16>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x120x16_F32F16F16_SS = SM90::GMMA::MMA_64x120x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x120x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_120,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<120, 16>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x120x16_F32F16F16_RS = SM90::GMMA::MMA_64x120x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x120x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_120,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<120, 16>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x136x16_F32F16F16_SS = SM90::GMMA::MMA_64x136x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x136x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_136,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<136, 16>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x136x16_F32F16F16_RS = SM90::GMMA::MMA_64x136x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x136x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_136,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<136, 16>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x144x16_F32F16F16_SS = SM90::GMMA::MMA_64x144x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x144x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_144,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<144, 16>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x144x16_F32F16F16_RS = SM90::GMMA::MMA_64x144x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x144x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_144,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<144, 16>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x152x16_F32F16F16_SS = SM90::GMMA::MMA_64x152x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x152x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_152,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<152, 16>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x152x16_F32F16F16_RS = SM90::GMMA::MMA_64x152x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x152x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_152,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<152, 16>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x160x16_F32F16F16_SS = SM90::GMMA::MMA_64x160x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x160x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_160,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<160, 16>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x160x16_F32F16F16_RS = SM90::GMMA::MMA_64x160x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x160x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_160,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<160, 16>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x168x16_F32F16F16_SS = SM90::GMMA::MMA_64x168x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x168x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_168,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<168, 16>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x168x16_F32F16F16_RS = SM90::GMMA::MMA_64x168x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x168x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_168,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<168, 16>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x176x16_F32F16F16_SS = SM90::GMMA::MMA_64x176x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x176x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_176,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<176, 16>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x176x16_F32F16F16_RS = SM90::GMMA::MMA_64x176x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x176x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_176,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<176, 16>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x184x16_F32F16F16_SS = SM90::GMMA::MMA_64x184x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x184x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_184,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<184, 16>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x184x16_F32F16F16_RS = SM90::GMMA::MMA_64x184x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x184x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_184,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<184, 16>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x200x16_F32F16F16_SS = SM90::GMMA::MMA_64x200x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x200x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_200,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<200, 16>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x200x16_F32F16F16_RS = SM90::GMMA::MMA_64x200x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x200x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_200,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<200, 16>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x208x16_F32F16F16_SS = SM90::GMMA::MMA_64x208x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x208x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_208,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<208, 16>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x208x16_F32F16F16_RS = SM90::GMMA::MMA_64x208x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x208x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_208,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<208, 16>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x216x16_F32F16F16_SS = SM90::GMMA::MMA_64x216x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x216x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_216,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<216, 16>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x216x16_F32F16F16_RS = SM90::GMMA::MMA_64x216x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x216x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_216,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<216, 16>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x224x16_F32F16F16_SS = SM90::GMMA::MMA_64x224x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x224x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_224,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<224, 16>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x224x16_F32F16F16_RS = SM90::GMMA::MMA_64x224x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x224x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_224,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<224, 16>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x232x16_F32F16F16_SS = SM90::GMMA::MMA_64x232x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x232x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_232,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<232, 16>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x232x16_F32F16F16_RS = SM90::GMMA::MMA_64x232x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x232x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_232,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<232, 16>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x240x16_F32F16F16_SS = SM90::GMMA::MMA_64x240x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x240x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_240,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<240, 16>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x240x16_F32F16F16_RS = SM90::GMMA::MMA_64x240x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x240x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_240,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<240, 16>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x248x16_F32F16F16_SS = SM90::GMMA::MMA_64x248x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x248x16_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_248,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<248, 16>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x248x16_F32F16F16_RS = SM90::GMMA::MMA_64x248x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x248x16_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = half_t;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_248,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<248, 16>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x24x16_F32BF16BF16_SS = SM90::GMMA::MMA_64x24x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x24x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_24,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout< 24, 16>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x24x16_F32BF16BF16_RS = SM90::GMMA::MMA_64x24x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x24x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_24,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout< 24, 16>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x40x16_F32BF16BF16_SS = SM90::GMMA::MMA_64x40x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x40x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_40,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout< 40, 16>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x40x16_F32BF16BF16_RS = SM90::GMMA::MMA_64x40x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x40x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_40,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout< 40, 16>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x48x16_F32BF16BF16_SS = SM90::GMMA::MMA_64x48x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x48x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_48,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout< 48, 16>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x48x16_F32BF16BF16_RS = SM90::GMMA::MMA_64x48x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x48x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_48,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout< 48, 16>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x56x16_F32BF16BF16_SS = SM90::GMMA::MMA_64x56x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x56x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_56,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout< 56, 16>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x56x16_F32BF16BF16_RS = SM90::GMMA::MMA_64x56x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x56x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_56,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout< 56, 16>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x72x16_F32BF16BF16_SS = SM90::GMMA::MMA_64x72x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x72x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_72,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout< 72, 16>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x72x16_F32BF16BF16_RS = SM90::GMMA::MMA_64x72x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x72x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_72,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout< 72, 16>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x80x16_F32BF16BF16_SS = SM90::GMMA::MMA_64x80x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x80x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_80,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout< 80, 16>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x80x16_F32BF16BF16_RS = SM90::GMMA::MMA_64x80x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x80x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_80,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout< 80, 16>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x88x16_F32BF16BF16_SS = SM90::GMMA::MMA_64x88x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x88x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_88,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout< 88, 16>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x88x16_F32BF16BF16_RS = SM90::GMMA::MMA_64x88x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x88x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_88,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout< 88, 16>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x104x16_F32BF16BF16_SS = SM90::GMMA::MMA_64x104x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x104x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_104,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<104, 16>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x104x16_F32BF16BF16_RS = SM90::GMMA::MMA_64x104x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x104x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_104,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<104, 16>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x112x16_F32BF16BF16_SS = SM90::GMMA::MMA_64x112x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x112x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_112,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<112, 16>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x112x16_F32BF16BF16_RS = SM90::GMMA::MMA_64x112x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x112x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_112,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<112, 16>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x120x16_F32BF16BF16_SS = SM90::GMMA::MMA_64x120x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x120x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_120,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<120, 16>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x120x16_F32BF16BF16_RS = SM90::GMMA::MMA_64x120x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x120x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_120,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<120, 16>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x136x16_F32BF16BF16_SS = SM90::GMMA::MMA_64x136x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x136x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_136,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<136, 16>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x136x16_F32BF16BF16_RS = SM90::GMMA::MMA_64x136x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x136x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_136,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<136, 16>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x144x16_F32BF16BF16_SS = SM90::GMMA::MMA_64x144x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x144x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_144,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<144, 16>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x144x16_F32BF16BF16_RS = SM90::GMMA::MMA_64x144x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x144x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_144,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<144, 16>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x152x16_F32BF16BF16_SS = SM90::GMMA::MMA_64x152x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x152x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_152,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<152, 16>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x152x16_F32BF16BF16_RS = SM90::GMMA::MMA_64x152x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x152x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_152,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<152, 16>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x160x16_F32BF16BF16_SS = SM90::GMMA::MMA_64x160x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x160x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_160,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<160, 16>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x160x16_F32BF16BF16_RS = SM90::GMMA::MMA_64x160x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x160x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_160,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<160, 16>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x168x16_F32BF16BF16_SS = SM90::GMMA::MMA_64x168x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x168x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_168,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<168, 16>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x168x16_F32BF16BF16_RS = SM90::GMMA::MMA_64x168x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x168x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_168,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<168, 16>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x176x16_F32BF16BF16_SS = SM90::GMMA::MMA_64x176x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x176x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_176,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<176, 16>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x176x16_F32BF16BF16_RS = SM90::GMMA::MMA_64x176x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x176x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_176,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<176, 16>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x184x16_F32BF16BF16_SS = SM90::GMMA::MMA_64x184x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x184x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_184,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<184, 16>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x184x16_F32BF16BF16_RS = SM90::GMMA::MMA_64x184x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x184x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_184,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<184, 16>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x200x16_F32BF16BF16_SS = SM90::GMMA::MMA_64x200x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x200x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_200,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<200, 16>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x200x16_F32BF16BF16_RS = SM90::GMMA::MMA_64x200x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x200x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_200,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<200, 16>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x208x16_F32BF16BF16_SS = SM90::GMMA::MMA_64x208x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x208x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_208,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<208, 16>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x208x16_F32BF16BF16_RS = SM90::GMMA::MMA_64x208x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x208x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_208,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<208, 16>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x216x16_F32BF16BF16_SS = SM90::GMMA::MMA_64x216x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x216x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_216,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<216, 16>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x216x16_F32BF16BF16_RS = SM90::GMMA::MMA_64x216x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x216x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_216,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<216, 16>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x224x16_F32BF16BF16_SS = SM90::GMMA::MMA_64x224x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x224x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_224,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<224, 16>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x224x16_F32BF16BF16_RS = SM90::GMMA::MMA_64x224x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x224x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_224,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<224, 16>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x232x16_F32BF16BF16_SS = SM90::GMMA::MMA_64x232x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x232x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_232,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<232, 16>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x232x16_F32BF16BF16_RS = SM90::GMMA::MMA_64x232x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x232x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_232,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<232, 16>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x240x16_F32BF16BF16_SS = SM90::GMMA::MMA_64x240x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x240x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_240,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<240, 16>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x240x16_F32BF16BF16_RS = SM90::GMMA::MMA_64x240x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x240x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_240,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<240, 16>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x248x16_F32BF16BF16_SS = SM90::GMMA::MMA_64x248x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>;
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x248x16_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_248,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using BLayout = GMMA::ABLayout<248, 16>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::Major tnspA,
+  GMMA::Major tnspB,
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x248x16_F32BF16BF16_RS = SM90::GMMA::MMA_64x248x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>; 
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x248x16_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = bfloat16_t;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_248,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using BLayout = GMMA::ABLayout<248, 16>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x24x8_F32TF32TF32_SS_TN = SM90::GMMA::MMA_64x24x8_F32TF32TF32_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x24x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64,  8>;
+  using BLayout = GMMA::ABLayout< 24,  8>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x24x8_F32TF32TF32_RS_TN = SM90::GMMA::MMA_64x24x8_F32TF32TF32_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x24x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x8;
+  using BLayout = GMMA::ABLayout< 24,  8>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x40x8_F32TF32TF32_SS_TN = SM90::GMMA::MMA_64x40x8_F32TF32TF32_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x40x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_40,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64,  8>;
+  using BLayout = GMMA::ABLayout< 40,  8>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x40x8_F32TF32TF32_RS_TN = SM90::GMMA::MMA_64x40x8_F32TF32TF32_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x40x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_40,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x8;
+  using BLayout = GMMA::ABLayout< 40,  8>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x48x8_F32TF32TF32_SS_TN = SM90::GMMA::MMA_64x48x8_F32TF32TF32_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x48x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64,  8>;
+  using BLayout = GMMA::ABLayout< 48,  8>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x48x8_F32TF32TF32_RS_TN = SM90::GMMA::MMA_64x48x8_F32TF32TF32_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x48x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x8;
+  using BLayout = GMMA::ABLayout< 48,  8>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x56x8_F32TF32TF32_SS_TN = SM90::GMMA::MMA_64x56x8_F32TF32TF32_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x56x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_56,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64,  8>;
+  using BLayout = GMMA::ABLayout< 56,  8>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x56x8_F32TF32TF32_RS_TN = SM90::GMMA::MMA_64x56x8_F32TF32TF32_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x56x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_56,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x8;
+  using BLayout = GMMA::ABLayout< 56,  8>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x72x8_F32TF32TF32_SS_TN = SM90::GMMA::MMA_64x72x8_F32TF32TF32_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x72x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_72,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64,  8>;
+  using BLayout = GMMA::ABLayout< 72,  8>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x72x8_F32TF32TF32_RS_TN = SM90::GMMA::MMA_64x72x8_F32TF32TF32_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x72x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_72,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x8;
+  using BLayout = GMMA::ABLayout< 72,  8>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x80x8_F32TF32TF32_SS_TN = SM90::GMMA::MMA_64x80x8_F32TF32TF32_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x80x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64,  8>;
+  using BLayout = GMMA::ABLayout< 80,  8>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x80x8_F32TF32TF32_RS_TN = SM90::GMMA::MMA_64x80x8_F32TF32TF32_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x80x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x8;
+  using BLayout = GMMA::ABLayout< 80,  8>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x88x8_F32TF32TF32_SS_TN = SM90::GMMA::MMA_64x88x8_F32TF32TF32_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x88x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_88,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64,  8>;
+  using BLayout = GMMA::ABLayout< 88,  8>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x88x8_F32TF32TF32_RS_TN = SM90::GMMA::MMA_64x88x8_F32TF32TF32_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x88x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_88,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x8;
+  using BLayout = GMMA::ABLayout< 88,  8>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x104x8_F32TF32TF32_SS_TN = SM90::GMMA::MMA_64x104x8_F32TF32TF32_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x104x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_104,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64,  8>;
+  using BLayout = GMMA::ABLayout<104,  8>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x104x8_F32TF32TF32_RS_TN = SM90::GMMA::MMA_64x104x8_F32TF32TF32_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x104x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_104,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x8;
+  using BLayout = GMMA::ABLayout<104,  8>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x112x8_F32TF32TF32_SS_TN = SM90::GMMA::MMA_64x112x8_F32TF32TF32_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x112x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64,  8>;
+  using BLayout = GMMA::ABLayout<112,  8>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x112x8_F32TF32TF32_RS_TN = SM90::GMMA::MMA_64x112x8_F32TF32TF32_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x112x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x8;
+  using BLayout = GMMA::ABLayout<112,  8>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x120x8_F32TF32TF32_SS_TN = SM90::GMMA::MMA_64x120x8_F32TF32TF32_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x120x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_120,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64,  8>;
+  using BLayout = GMMA::ABLayout<120,  8>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x120x8_F32TF32TF32_RS_TN = SM90::GMMA::MMA_64x120x8_F32TF32TF32_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x120x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_120,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x8;
+  using BLayout = GMMA::ABLayout<120,  8>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x136x8_F32TF32TF32_SS_TN = SM90::GMMA::MMA_64x136x8_F32TF32TF32_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x136x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_136,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64,  8>;
+  using BLayout = GMMA::ABLayout<136,  8>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x136x8_F32TF32TF32_RS_TN = SM90::GMMA::MMA_64x136x8_F32TF32TF32_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x136x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_136,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x8;
+  using BLayout = GMMA::ABLayout<136,  8>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x144x8_F32TF32TF32_SS_TN = SM90::GMMA::MMA_64x144x8_F32TF32TF32_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x144x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64,  8>;
+  using BLayout = GMMA::ABLayout<144,  8>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x144x8_F32TF32TF32_RS_TN = SM90::GMMA::MMA_64x144x8_F32TF32TF32_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x144x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x8;
+  using BLayout = GMMA::ABLayout<144,  8>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x152x8_F32TF32TF32_SS_TN = SM90::GMMA::MMA_64x152x8_F32TF32TF32_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x152x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_152,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64,  8>;
+  using BLayout = GMMA::ABLayout<152,  8>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x152x8_F32TF32TF32_RS_TN = SM90::GMMA::MMA_64x152x8_F32TF32TF32_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x152x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_152,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x8;
+  using BLayout = GMMA::ABLayout<152,  8>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x160x8_F32TF32TF32_SS_TN = SM90::GMMA::MMA_64x160x8_F32TF32TF32_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x160x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64,  8>;
+  using BLayout = GMMA::ABLayout<160,  8>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x160x8_F32TF32TF32_RS_TN = SM90::GMMA::MMA_64x160x8_F32TF32TF32_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x160x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x8;
+  using BLayout = GMMA::ABLayout<160,  8>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x168x8_F32TF32TF32_SS_TN = SM90::GMMA::MMA_64x168x8_F32TF32TF32_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x168x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_168,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64,  8>;
+  using BLayout = GMMA::ABLayout<168,  8>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x168x8_F32TF32TF32_RS_TN = SM90::GMMA::MMA_64x168x8_F32TF32TF32_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x168x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_168,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x8;
+  using BLayout = GMMA::ABLayout<168,  8>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x176x8_F32TF32TF32_SS_TN = SM90::GMMA::MMA_64x176x8_F32TF32TF32_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x176x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64,  8>;
+  using BLayout = GMMA::ABLayout<176,  8>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x176x8_F32TF32TF32_RS_TN = SM90::GMMA::MMA_64x176x8_F32TF32TF32_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x176x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x8;
+  using BLayout = GMMA::ABLayout<176,  8>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x184x8_F32TF32TF32_SS_TN = SM90::GMMA::MMA_64x184x8_F32TF32TF32_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x184x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_184,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64,  8>;
+  using BLayout = GMMA::ABLayout<184,  8>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x184x8_F32TF32TF32_RS_TN = SM90::GMMA::MMA_64x184x8_F32TF32TF32_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x184x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_184,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x8;
+  using BLayout = GMMA::ABLayout<184,  8>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x200x8_F32TF32TF32_SS_TN = SM90::GMMA::MMA_64x200x8_F32TF32TF32_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x200x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_200,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64,  8>;
+  using BLayout = GMMA::ABLayout<200,  8>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x200x8_F32TF32TF32_RS_TN = SM90::GMMA::MMA_64x200x8_F32TF32TF32_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x200x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_200,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x8;
+  using BLayout = GMMA::ABLayout<200,  8>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x208x8_F32TF32TF32_SS_TN = SM90::GMMA::MMA_64x208x8_F32TF32TF32_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x208x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64,  8>;
+  using BLayout = GMMA::ABLayout<208,  8>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x208x8_F32TF32TF32_RS_TN = SM90::GMMA::MMA_64x208x8_F32TF32TF32_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x208x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x8;
+  using BLayout = GMMA::ABLayout<208,  8>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x216x8_F32TF32TF32_SS_TN = SM90::GMMA::MMA_64x216x8_F32TF32TF32_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x216x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_216,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64,  8>;
+  using BLayout = GMMA::ABLayout<216,  8>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x216x8_F32TF32TF32_RS_TN = SM90::GMMA::MMA_64x216x8_F32TF32TF32_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x216x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_216,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x8;
+  using BLayout = GMMA::ABLayout<216,  8>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x224x8_F32TF32TF32_SS_TN = SM90::GMMA::MMA_64x224x8_F32TF32TF32_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x224x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64,  8>;
+  using BLayout = GMMA::ABLayout<224,  8>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x224x8_F32TF32TF32_RS_TN = SM90::GMMA::MMA_64x224x8_F32TF32TF32_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x224x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x8;
+  using BLayout = GMMA::ABLayout<224,  8>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x232x8_F32TF32TF32_SS_TN = SM90::GMMA::MMA_64x232x8_F32TF32TF32_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x232x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_232,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64,  8>;
+  using BLayout = GMMA::ABLayout<232,  8>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x232x8_F32TF32TF32_RS_TN = SM90::GMMA::MMA_64x232x8_F32TF32TF32_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x232x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_232,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x8;
+  using BLayout = GMMA::ABLayout<232,  8>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x240x8_F32TF32TF32_SS_TN = SM90::GMMA::MMA_64x240x8_F32TF32TF32_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x240x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64,  8>;
+  using BLayout = GMMA::ABLayout<240,  8>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x240x8_F32TF32TF32_RS_TN = SM90::GMMA::MMA_64x240x8_F32TF32TF32_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x240x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x8;
+  using BLayout = GMMA::ABLayout<240,  8>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x248x8_F32TF32TF32_SS_TN = SM90::GMMA::MMA_64x248x8_F32TF32TF32_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x248x8_F32TF32TF32_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_248,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64,  8>;
+  using BLayout = GMMA::ABLayout<248,  8>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x248x8_F32TF32TF32_RS_TN = SM90::GMMA::MMA_64x248x8_F32TF32TF32_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x248x8_F32TF32TF32_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = tfloat32_t;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_248,_8>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x8;
+  using BLayout = GMMA::ABLayout<248,  8>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x24x32_S32S8S8_SS_TN = SM90::GMMA::MMA_64x24x32_S32S8S8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x24x32_S32S8S8_SS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 24, 32>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x24x32_S32S8S8_SS_TN_SATURATE = SM90::GMMA::MMA_64x24x32_S32S8S8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x24x32_S32S8S8_SS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 24, 32>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x48x32_S32S8S8_SS_TN = SM90::GMMA::MMA_64x48x32_S32S8S8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x48x32_S32S8S8_SS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 48, 32>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x48x32_S32S8S8_SS_TN_SATURATE = SM90::GMMA::MMA_64x48x32_S32S8S8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x48x32_S32S8S8_SS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 48, 32>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x80x32_S32S8S8_SS_TN = SM90::GMMA::MMA_64x80x32_S32S8S8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x80x32_S32S8S8_SS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 80, 32>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x80x32_S32S8S8_SS_TN_SATURATE = SM90::GMMA::MMA_64x80x32_S32S8S8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x80x32_S32S8S8_SS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 80, 32>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x112x32_S32S8S8_SS_TN = SM90::GMMA::MMA_64x112x32_S32S8S8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x112x32_S32S8S8_SS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<112, 32>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x112x32_S32S8S8_SS_TN_SATURATE = SM90::GMMA::MMA_64x112x32_S32S8S8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x112x32_S32S8S8_SS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<112, 32>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x144x32_S32S8S8_SS_TN = SM90::GMMA::MMA_64x144x32_S32S8S8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x144x32_S32S8S8_SS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<144, 32>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x144x32_S32S8S8_SS_TN_SATURATE = SM90::GMMA::MMA_64x144x32_S32S8S8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x144x32_S32S8S8_SS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<144, 32>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x160x32_S32S8S8_SS_TN = SM90::GMMA::MMA_64x160x32_S32S8S8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x160x32_S32S8S8_SS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<160, 32>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x160x32_S32S8S8_SS_TN_SATURATE = SM90::GMMA::MMA_64x160x32_S32S8S8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x160x32_S32S8S8_SS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<160, 32>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x176x32_S32S8S8_SS_TN = SM90::GMMA::MMA_64x176x32_S32S8S8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x176x32_S32S8S8_SS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<176, 32>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x176x32_S32S8S8_SS_TN_SATURATE = SM90::GMMA::MMA_64x176x32_S32S8S8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x176x32_S32S8S8_SS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<176, 32>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x208x32_S32S8S8_SS_TN = SM90::GMMA::MMA_64x208x32_S32S8S8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x208x32_S32S8S8_SS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<208, 32>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x208x32_S32S8S8_SS_TN_SATURATE = SM90::GMMA::MMA_64x208x32_S32S8S8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x208x32_S32S8S8_SS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<208, 32>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x224x32_S32S8S8_SS_TN = SM90::GMMA::MMA_64x224x32_S32S8S8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x224x32_S32S8S8_SS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<224, 32>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x224x32_S32S8S8_SS_TN_SATURATE = SM90::GMMA::MMA_64x224x32_S32S8S8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x224x32_S32S8S8_SS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<224, 32>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x240x32_S32S8S8_SS_TN = SM90::GMMA::MMA_64x240x32_S32S8S8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x240x32_S32S8S8_SS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<240, 32>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x240x32_S32S8S8_SS_TN_SATURATE = SM90::GMMA::MMA_64x240x32_S32S8S8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x240x32_S32S8S8_SS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<240, 32>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x24x32_S32S8S8_RS_TN = SM90::GMMA::MMA_64x24x32_S32S8S8_RS_TN; 
+
+template <>
+struct MMA_Traits<SM90_64x24x32_S32S8S8_RS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 24, 32>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x24x32_S32S8S8_RS_TN_SATURATE = SM90::GMMA::MMA_64x24x32_S32S8S8_RS_TN_SATURATE; 
+
+template <>
+struct MMA_Traits<SM90_64x24x32_S32S8S8_RS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 24, 32>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x48x32_S32S8S8_RS_TN = SM90::GMMA::MMA_64x48x32_S32S8S8_RS_TN; 
+
+template <>
+struct MMA_Traits<SM90_64x48x32_S32S8S8_RS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 48, 32>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x48x32_S32S8S8_RS_TN_SATURATE = SM90::GMMA::MMA_64x48x32_S32S8S8_RS_TN_SATURATE; 
+
+template <>
+struct MMA_Traits<SM90_64x48x32_S32S8S8_RS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 48, 32>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x80x32_S32S8S8_RS_TN = SM90::GMMA::MMA_64x80x32_S32S8S8_RS_TN; 
+
+template <>
+struct MMA_Traits<SM90_64x80x32_S32S8S8_RS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 80, 32>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x80x32_S32S8S8_RS_TN_SATURATE = SM90::GMMA::MMA_64x80x32_S32S8S8_RS_TN_SATURATE; 
+
+template <>
+struct MMA_Traits<SM90_64x80x32_S32S8S8_RS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 80, 32>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x112x32_S32S8S8_RS_TN = SM90::GMMA::MMA_64x112x32_S32S8S8_RS_TN; 
+
+template <>
+struct MMA_Traits<SM90_64x112x32_S32S8S8_RS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<112, 32>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x112x32_S32S8S8_RS_TN_SATURATE = SM90::GMMA::MMA_64x112x32_S32S8S8_RS_TN_SATURATE; 
+
+template <>
+struct MMA_Traits<SM90_64x112x32_S32S8S8_RS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<112, 32>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x144x32_S32S8S8_RS_TN = SM90::GMMA::MMA_64x144x32_S32S8S8_RS_TN; 
+
+template <>
+struct MMA_Traits<SM90_64x144x32_S32S8S8_RS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<144, 32>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x144x32_S32S8S8_RS_TN_SATURATE = SM90::GMMA::MMA_64x144x32_S32S8S8_RS_TN_SATURATE; 
+
+template <>
+struct MMA_Traits<SM90_64x144x32_S32S8S8_RS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<144, 32>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x160x32_S32S8S8_RS_TN = SM90::GMMA::MMA_64x160x32_S32S8S8_RS_TN; 
+
+template <>
+struct MMA_Traits<SM90_64x160x32_S32S8S8_RS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<160, 32>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x160x32_S32S8S8_RS_TN_SATURATE = SM90::GMMA::MMA_64x160x32_S32S8S8_RS_TN_SATURATE; 
+
+template <>
+struct MMA_Traits<SM90_64x160x32_S32S8S8_RS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<160, 32>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x176x32_S32S8S8_RS_TN = SM90::GMMA::MMA_64x176x32_S32S8S8_RS_TN; 
+
+template <>
+struct MMA_Traits<SM90_64x176x32_S32S8S8_RS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<176, 32>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x176x32_S32S8S8_RS_TN_SATURATE = SM90::GMMA::MMA_64x176x32_S32S8S8_RS_TN_SATURATE; 
+
+template <>
+struct MMA_Traits<SM90_64x176x32_S32S8S8_RS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<176, 32>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x208x32_S32S8S8_RS_TN = SM90::GMMA::MMA_64x208x32_S32S8S8_RS_TN; 
+
+template <>
+struct MMA_Traits<SM90_64x208x32_S32S8S8_RS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<208, 32>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x208x32_S32S8S8_RS_TN_SATURATE = SM90::GMMA::MMA_64x208x32_S32S8S8_RS_TN_SATURATE; 
+
+template <>
+struct MMA_Traits<SM90_64x208x32_S32S8S8_RS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<208, 32>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x224x32_S32S8S8_RS_TN = SM90::GMMA::MMA_64x224x32_S32S8S8_RS_TN; 
+
+template <>
+struct MMA_Traits<SM90_64x224x32_S32S8S8_RS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<224, 32>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x224x32_S32S8S8_RS_TN_SATURATE = SM90::GMMA::MMA_64x224x32_S32S8S8_RS_TN_SATURATE; 
+
+template <>
+struct MMA_Traits<SM90_64x224x32_S32S8S8_RS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<224, 32>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x240x32_S32S8S8_RS_TN = SM90::GMMA::MMA_64x240x32_S32S8S8_RS_TN; 
+
+template <>
+struct MMA_Traits<SM90_64x240x32_S32S8S8_RS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<240, 32>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x240x32_S32S8S8_RS_TN_SATURATE = SM90::GMMA::MMA_64x240x32_S32S8S8_RS_TN_SATURATE; 
+
+template <>
+struct MMA_Traits<SM90_64x240x32_S32S8S8_RS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<240, 32>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x24x32_S32S8U8_SS_TN = SM90::GMMA::MMA_64x24x32_S32S8U8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x24x32_S32S8U8_SS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 24, 32>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x24x32_S32S8U8_SS_TN_SATURATE = SM90::GMMA::MMA_64x24x32_S32S8U8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x24x32_S32S8U8_SS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 24, 32>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x48x32_S32S8U8_SS_TN = SM90::GMMA::MMA_64x48x32_S32S8U8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x48x32_S32S8U8_SS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 48, 32>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x48x32_S32S8U8_SS_TN_SATURATE = SM90::GMMA::MMA_64x48x32_S32S8U8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x48x32_S32S8U8_SS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 48, 32>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x80x32_S32S8U8_SS_TN = SM90::GMMA::MMA_64x80x32_S32S8U8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x80x32_S32S8U8_SS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 80, 32>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x80x32_S32S8U8_SS_TN_SATURATE = SM90::GMMA::MMA_64x80x32_S32S8U8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x80x32_S32S8U8_SS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 80, 32>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x112x32_S32S8U8_SS_TN = SM90::GMMA::MMA_64x112x32_S32S8U8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x112x32_S32S8U8_SS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<112, 32>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x112x32_S32S8U8_SS_TN_SATURATE = SM90::GMMA::MMA_64x112x32_S32S8U8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x112x32_S32S8U8_SS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<112, 32>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x144x32_S32S8U8_SS_TN = SM90::GMMA::MMA_64x144x32_S32S8U8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x144x32_S32S8U8_SS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<144, 32>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x144x32_S32S8U8_SS_TN_SATURATE = SM90::GMMA::MMA_64x144x32_S32S8U8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x144x32_S32S8U8_SS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<144, 32>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x160x32_S32S8U8_SS_TN = SM90::GMMA::MMA_64x160x32_S32S8U8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x160x32_S32S8U8_SS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<160, 32>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x160x32_S32S8U8_SS_TN_SATURATE = SM90::GMMA::MMA_64x160x32_S32S8U8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x160x32_S32S8U8_SS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<160, 32>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x176x32_S32S8U8_SS_TN = SM90::GMMA::MMA_64x176x32_S32S8U8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x176x32_S32S8U8_SS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<176, 32>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x176x32_S32S8U8_SS_TN_SATURATE = SM90::GMMA::MMA_64x176x32_S32S8U8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x176x32_S32S8U8_SS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<176, 32>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x208x32_S32S8U8_SS_TN = SM90::GMMA::MMA_64x208x32_S32S8U8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x208x32_S32S8U8_SS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<208, 32>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x208x32_S32S8U8_SS_TN_SATURATE = SM90::GMMA::MMA_64x208x32_S32S8U8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x208x32_S32S8U8_SS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<208, 32>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x224x32_S32S8U8_SS_TN = SM90::GMMA::MMA_64x224x32_S32S8U8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x224x32_S32S8U8_SS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<224, 32>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x224x32_S32S8U8_SS_TN_SATURATE = SM90::GMMA::MMA_64x224x32_S32S8U8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x224x32_S32S8U8_SS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<224, 32>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x240x32_S32S8U8_SS_TN = SM90::GMMA::MMA_64x240x32_S32S8U8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x240x32_S32S8U8_SS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<240, 32>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x240x32_S32S8U8_SS_TN_SATURATE = SM90::GMMA::MMA_64x240x32_S32S8U8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x240x32_S32S8U8_SS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<240, 32>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x24x32_S32S8U8_RS_TN = SM90::GMMA::MMA_64x24x32_S32S8U8_RS_TN; 
+
+template <>
+struct MMA_Traits<SM90_64x24x32_S32S8U8_RS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 24, 32>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x24x32_S32S8U8_RS_TN_SATURATE = SM90::GMMA::MMA_64x24x32_S32S8U8_RS_TN_SATURATE; 
+
+template <>
+struct MMA_Traits<SM90_64x24x32_S32S8U8_RS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 24, 32>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x48x32_S32S8U8_RS_TN = SM90::GMMA::MMA_64x48x32_S32S8U8_RS_TN; 
+
+template <>
+struct MMA_Traits<SM90_64x48x32_S32S8U8_RS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 48, 32>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x48x32_S32S8U8_RS_TN_SATURATE = SM90::GMMA::MMA_64x48x32_S32S8U8_RS_TN_SATURATE; 
+
+template <>
+struct MMA_Traits<SM90_64x48x32_S32S8U8_RS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 48, 32>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x80x32_S32S8U8_RS_TN = SM90::GMMA::MMA_64x80x32_S32S8U8_RS_TN; 
+
+template <>
+struct MMA_Traits<SM90_64x80x32_S32S8U8_RS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 80, 32>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x80x32_S32S8U8_RS_TN_SATURATE = SM90::GMMA::MMA_64x80x32_S32S8U8_RS_TN_SATURATE; 
+
+template <>
+struct MMA_Traits<SM90_64x80x32_S32S8U8_RS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 80, 32>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x112x32_S32S8U8_RS_TN = SM90::GMMA::MMA_64x112x32_S32S8U8_RS_TN; 
+
+template <>
+struct MMA_Traits<SM90_64x112x32_S32S8U8_RS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<112, 32>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x112x32_S32S8U8_RS_TN_SATURATE = SM90::GMMA::MMA_64x112x32_S32S8U8_RS_TN_SATURATE; 
+
+template <>
+struct MMA_Traits<SM90_64x112x32_S32S8U8_RS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<112, 32>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x144x32_S32S8U8_RS_TN = SM90::GMMA::MMA_64x144x32_S32S8U8_RS_TN; 
+
+template <>
+struct MMA_Traits<SM90_64x144x32_S32S8U8_RS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<144, 32>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x144x32_S32S8U8_RS_TN_SATURATE = SM90::GMMA::MMA_64x144x32_S32S8U8_RS_TN_SATURATE; 
+
+template <>
+struct MMA_Traits<SM90_64x144x32_S32S8U8_RS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<144, 32>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x160x32_S32S8U8_RS_TN = SM90::GMMA::MMA_64x160x32_S32S8U8_RS_TN; 
+
+template <>
+struct MMA_Traits<SM90_64x160x32_S32S8U8_RS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<160, 32>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x160x32_S32S8U8_RS_TN_SATURATE = SM90::GMMA::MMA_64x160x32_S32S8U8_RS_TN_SATURATE; 
+
+template <>
+struct MMA_Traits<SM90_64x160x32_S32S8U8_RS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<160, 32>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x176x32_S32S8U8_RS_TN = SM90::GMMA::MMA_64x176x32_S32S8U8_RS_TN; 
+
+template <>
+struct MMA_Traits<SM90_64x176x32_S32S8U8_RS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<176, 32>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x176x32_S32S8U8_RS_TN_SATURATE = SM90::GMMA::MMA_64x176x32_S32S8U8_RS_TN_SATURATE; 
+
+template <>
+struct MMA_Traits<SM90_64x176x32_S32S8U8_RS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<176, 32>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x208x32_S32S8U8_RS_TN = SM90::GMMA::MMA_64x208x32_S32S8U8_RS_TN; 
+
+template <>
+struct MMA_Traits<SM90_64x208x32_S32S8U8_RS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<208, 32>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x208x32_S32S8U8_RS_TN_SATURATE = SM90::GMMA::MMA_64x208x32_S32S8U8_RS_TN_SATURATE; 
+
+template <>
+struct MMA_Traits<SM90_64x208x32_S32S8U8_RS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<208, 32>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x224x32_S32S8U8_RS_TN = SM90::GMMA::MMA_64x224x32_S32S8U8_RS_TN; 
+
+template <>
+struct MMA_Traits<SM90_64x224x32_S32S8U8_RS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<224, 32>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x224x32_S32S8U8_RS_TN_SATURATE = SM90::GMMA::MMA_64x224x32_S32S8U8_RS_TN_SATURATE; 
+
+template <>
+struct MMA_Traits<SM90_64x224x32_S32S8U8_RS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<224, 32>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x240x32_S32S8U8_RS_TN = SM90::GMMA::MMA_64x240x32_S32S8U8_RS_TN; 
+
+template <>
+struct MMA_Traits<SM90_64x240x32_S32S8U8_RS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<240, 32>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x240x32_S32S8U8_RS_TN_SATURATE = SM90::GMMA::MMA_64x240x32_S32S8U8_RS_TN_SATURATE; 
+
+template <>
+struct MMA_Traits<SM90_64x240x32_S32S8U8_RS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = int8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<240, 32>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x24x32_S32U8S8_SS_TN = SM90::GMMA::MMA_64x24x32_S32U8S8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x24x32_S32U8S8_SS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 24, 32>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x24x32_S32U8S8_SS_TN_SATURATE = SM90::GMMA::MMA_64x24x32_S32U8S8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x24x32_S32U8S8_SS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 24, 32>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x48x32_S32U8S8_SS_TN = SM90::GMMA::MMA_64x48x32_S32U8S8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x48x32_S32U8S8_SS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 48, 32>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x48x32_S32U8S8_SS_TN_SATURATE = SM90::GMMA::MMA_64x48x32_S32U8S8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x48x32_S32U8S8_SS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 48, 32>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x80x32_S32U8S8_SS_TN = SM90::GMMA::MMA_64x80x32_S32U8S8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x80x32_S32U8S8_SS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 80, 32>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x80x32_S32U8S8_SS_TN_SATURATE = SM90::GMMA::MMA_64x80x32_S32U8S8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x80x32_S32U8S8_SS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 80, 32>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x112x32_S32U8S8_SS_TN = SM90::GMMA::MMA_64x112x32_S32U8S8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x112x32_S32U8S8_SS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<112, 32>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x112x32_S32U8S8_SS_TN_SATURATE = SM90::GMMA::MMA_64x112x32_S32U8S8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x112x32_S32U8S8_SS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<112, 32>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x144x32_S32U8S8_SS_TN = SM90::GMMA::MMA_64x144x32_S32U8S8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x144x32_S32U8S8_SS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<144, 32>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x144x32_S32U8S8_SS_TN_SATURATE = SM90::GMMA::MMA_64x144x32_S32U8S8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x144x32_S32U8S8_SS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<144, 32>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x160x32_S32U8S8_SS_TN = SM90::GMMA::MMA_64x160x32_S32U8S8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x160x32_S32U8S8_SS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<160, 32>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x160x32_S32U8S8_SS_TN_SATURATE = SM90::GMMA::MMA_64x160x32_S32U8S8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x160x32_S32U8S8_SS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<160, 32>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x176x32_S32U8S8_SS_TN = SM90::GMMA::MMA_64x176x32_S32U8S8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x176x32_S32U8S8_SS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<176, 32>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x176x32_S32U8S8_SS_TN_SATURATE = SM90::GMMA::MMA_64x176x32_S32U8S8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x176x32_S32U8S8_SS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<176, 32>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x208x32_S32U8S8_SS_TN = SM90::GMMA::MMA_64x208x32_S32U8S8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x208x32_S32U8S8_SS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<208, 32>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x208x32_S32U8S8_SS_TN_SATURATE = SM90::GMMA::MMA_64x208x32_S32U8S8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x208x32_S32U8S8_SS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<208, 32>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x224x32_S32U8S8_SS_TN = SM90::GMMA::MMA_64x224x32_S32U8S8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x224x32_S32U8S8_SS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<224, 32>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x224x32_S32U8S8_SS_TN_SATURATE = SM90::GMMA::MMA_64x224x32_S32U8S8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x224x32_S32U8S8_SS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<224, 32>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x240x32_S32U8S8_SS_TN = SM90::GMMA::MMA_64x240x32_S32U8S8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x240x32_S32U8S8_SS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<240, 32>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x240x32_S32U8S8_SS_TN_SATURATE = SM90::GMMA::MMA_64x240x32_S32U8S8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x240x32_S32U8S8_SS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<240, 32>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x24x32_S32U8S8_RS_TN = SM90::GMMA::MMA_64x24x32_S32U8S8_RS_TN; 
+
+template <>
+struct MMA_Traits<SM90_64x24x32_S32U8S8_RS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 24, 32>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x24x32_S32U8S8_RS_TN_SATURATE = SM90::GMMA::MMA_64x24x32_S32U8S8_RS_TN_SATURATE; 
+
+template <>
+struct MMA_Traits<SM90_64x24x32_S32U8S8_RS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 24, 32>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x48x32_S32U8S8_RS_TN = SM90::GMMA::MMA_64x48x32_S32U8S8_RS_TN; 
+
+template <>
+struct MMA_Traits<SM90_64x48x32_S32U8S8_RS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 48, 32>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x48x32_S32U8S8_RS_TN_SATURATE = SM90::GMMA::MMA_64x48x32_S32U8S8_RS_TN_SATURATE; 
+
+template <>
+struct MMA_Traits<SM90_64x48x32_S32U8S8_RS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 48, 32>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x80x32_S32U8S8_RS_TN = SM90::GMMA::MMA_64x80x32_S32U8S8_RS_TN; 
+
+template <>
+struct MMA_Traits<SM90_64x80x32_S32U8S8_RS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 80, 32>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x80x32_S32U8S8_RS_TN_SATURATE = SM90::GMMA::MMA_64x80x32_S32U8S8_RS_TN_SATURATE; 
+
+template <>
+struct MMA_Traits<SM90_64x80x32_S32U8S8_RS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 80, 32>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x112x32_S32U8S8_RS_TN = SM90::GMMA::MMA_64x112x32_S32U8S8_RS_TN; 
+
+template <>
+struct MMA_Traits<SM90_64x112x32_S32U8S8_RS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<112, 32>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x112x32_S32U8S8_RS_TN_SATURATE = SM90::GMMA::MMA_64x112x32_S32U8S8_RS_TN_SATURATE; 
+
+template <>
+struct MMA_Traits<SM90_64x112x32_S32U8S8_RS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<112, 32>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x144x32_S32U8S8_RS_TN = SM90::GMMA::MMA_64x144x32_S32U8S8_RS_TN; 
+
+template <>
+struct MMA_Traits<SM90_64x144x32_S32U8S8_RS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<144, 32>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x144x32_S32U8S8_RS_TN_SATURATE = SM90::GMMA::MMA_64x144x32_S32U8S8_RS_TN_SATURATE; 
+
+template <>
+struct MMA_Traits<SM90_64x144x32_S32U8S8_RS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<144, 32>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x160x32_S32U8S8_RS_TN = SM90::GMMA::MMA_64x160x32_S32U8S8_RS_TN; 
+
+template <>
+struct MMA_Traits<SM90_64x160x32_S32U8S8_RS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<160, 32>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x160x32_S32U8S8_RS_TN_SATURATE = SM90::GMMA::MMA_64x160x32_S32U8S8_RS_TN_SATURATE; 
+
+template <>
+struct MMA_Traits<SM90_64x160x32_S32U8S8_RS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<160, 32>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x176x32_S32U8S8_RS_TN = SM90::GMMA::MMA_64x176x32_S32U8S8_RS_TN; 
+
+template <>
+struct MMA_Traits<SM90_64x176x32_S32U8S8_RS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<176, 32>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x176x32_S32U8S8_RS_TN_SATURATE = SM90::GMMA::MMA_64x176x32_S32U8S8_RS_TN_SATURATE; 
+
+template <>
+struct MMA_Traits<SM90_64x176x32_S32U8S8_RS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<176, 32>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x208x32_S32U8S8_RS_TN = SM90::GMMA::MMA_64x208x32_S32U8S8_RS_TN; 
+
+template <>
+struct MMA_Traits<SM90_64x208x32_S32U8S8_RS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<208, 32>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x208x32_S32U8S8_RS_TN_SATURATE = SM90::GMMA::MMA_64x208x32_S32U8S8_RS_TN_SATURATE; 
+
+template <>
+struct MMA_Traits<SM90_64x208x32_S32U8S8_RS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<208, 32>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x224x32_S32U8S8_RS_TN = SM90::GMMA::MMA_64x224x32_S32U8S8_RS_TN; 
+
+template <>
+struct MMA_Traits<SM90_64x224x32_S32U8S8_RS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<224, 32>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x224x32_S32U8S8_RS_TN_SATURATE = SM90::GMMA::MMA_64x224x32_S32U8S8_RS_TN_SATURATE; 
+
+template <>
+struct MMA_Traits<SM90_64x224x32_S32U8S8_RS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<224, 32>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x240x32_S32U8S8_RS_TN = SM90::GMMA::MMA_64x240x32_S32U8S8_RS_TN; 
+
+template <>
+struct MMA_Traits<SM90_64x240x32_S32U8S8_RS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<240, 32>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x240x32_S32U8S8_RS_TN_SATURATE = SM90::GMMA::MMA_64x240x32_S32U8S8_RS_TN_SATURATE; 
+
+template <>
+struct MMA_Traits<SM90_64x240x32_S32U8S8_RS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<240, 32>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x24x32_S32U8U8_SS_TN = SM90::GMMA::MMA_64x24x32_S32U8U8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x24x32_S32U8U8_SS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 24, 32>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x24x32_S32U8U8_SS_TN_SATURATE = SM90::GMMA::MMA_64x24x32_S32U8U8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x24x32_S32U8U8_SS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 24, 32>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x48x32_S32U8U8_SS_TN = SM90::GMMA::MMA_64x48x32_S32U8U8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x48x32_S32U8U8_SS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 48, 32>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x48x32_S32U8U8_SS_TN_SATURATE = SM90::GMMA::MMA_64x48x32_S32U8U8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x48x32_S32U8U8_SS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 48, 32>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x80x32_S32U8U8_SS_TN = SM90::GMMA::MMA_64x80x32_S32U8U8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x80x32_S32U8U8_SS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 80, 32>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x80x32_S32U8U8_SS_TN_SATURATE = SM90::GMMA::MMA_64x80x32_S32U8U8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x80x32_S32U8U8_SS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 80, 32>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x112x32_S32U8U8_SS_TN = SM90::GMMA::MMA_64x112x32_S32U8U8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x112x32_S32U8U8_SS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<112, 32>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x112x32_S32U8U8_SS_TN_SATURATE = SM90::GMMA::MMA_64x112x32_S32U8U8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x112x32_S32U8U8_SS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<112, 32>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x144x32_S32U8U8_SS_TN = SM90::GMMA::MMA_64x144x32_S32U8U8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x144x32_S32U8U8_SS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<144, 32>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x144x32_S32U8U8_SS_TN_SATURATE = SM90::GMMA::MMA_64x144x32_S32U8U8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x144x32_S32U8U8_SS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<144, 32>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x160x32_S32U8U8_SS_TN = SM90::GMMA::MMA_64x160x32_S32U8U8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x160x32_S32U8U8_SS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<160, 32>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x160x32_S32U8U8_SS_TN_SATURATE = SM90::GMMA::MMA_64x160x32_S32U8U8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x160x32_S32U8U8_SS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<160, 32>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x176x32_S32U8U8_SS_TN = SM90::GMMA::MMA_64x176x32_S32U8U8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x176x32_S32U8U8_SS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<176, 32>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x176x32_S32U8U8_SS_TN_SATURATE = SM90::GMMA::MMA_64x176x32_S32U8U8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x176x32_S32U8U8_SS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<176, 32>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x208x32_S32U8U8_SS_TN = SM90::GMMA::MMA_64x208x32_S32U8U8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x208x32_S32U8U8_SS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<208, 32>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x208x32_S32U8U8_SS_TN_SATURATE = SM90::GMMA::MMA_64x208x32_S32U8U8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x208x32_S32U8U8_SS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<208, 32>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x224x32_S32U8U8_SS_TN = SM90::GMMA::MMA_64x224x32_S32U8U8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x224x32_S32U8U8_SS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<224, 32>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x224x32_S32U8U8_SS_TN_SATURATE = SM90::GMMA::MMA_64x224x32_S32U8U8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x224x32_S32U8U8_SS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<224, 32>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x240x32_S32U8U8_SS_TN = SM90::GMMA::MMA_64x240x32_S32U8U8_SS_TN;
+
+template <>
+struct MMA_Traits<SM90_64x240x32_S32U8U8_SS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<240, 32>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x240x32_S32U8U8_SS_TN_SATURATE = SM90::GMMA::MMA_64x240x32_S32U8U8_SS_TN_SATURATE;
+
+template <>
+struct MMA_Traits<SM90_64x240x32_S32U8U8_SS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<240, 32>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x24x32_S32U8U8_RS_TN = SM90::GMMA::MMA_64x24x32_S32U8U8_RS_TN; 
+
+template <>
+struct MMA_Traits<SM90_64x24x32_S32U8U8_RS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 24, 32>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x24x32_S32U8U8_RS_TN_SATURATE = SM90::GMMA::MMA_64x24x32_S32U8U8_RS_TN_SATURATE; 
+
+template <>
+struct MMA_Traits<SM90_64x24x32_S32U8U8_RS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 24, 32>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x48x32_S32U8U8_RS_TN = SM90::GMMA::MMA_64x48x32_S32U8U8_RS_TN; 
+
+template <>
+struct MMA_Traits<SM90_64x48x32_S32U8U8_RS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 48, 32>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x48x32_S32U8U8_RS_TN_SATURATE = SM90::GMMA::MMA_64x48x32_S32U8U8_RS_TN_SATURATE; 
+
+template <>
+struct MMA_Traits<SM90_64x48x32_S32U8U8_RS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 48, 32>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x80x32_S32U8U8_RS_TN = SM90::GMMA::MMA_64x80x32_S32U8U8_RS_TN; 
+
+template <>
+struct MMA_Traits<SM90_64x80x32_S32U8U8_RS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 80, 32>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x80x32_S32U8U8_RS_TN_SATURATE = SM90::GMMA::MMA_64x80x32_S32U8U8_RS_TN_SATURATE; 
+
+template <>
+struct MMA_Traits<SM90_64x80x32_S32U8U8_RS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 80, 32>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x112x32_S32U8U8_RS_TN = SM90::GMMA::MMA_64x112x32_S32U8U8_RS_TN; 
+
+template <>
+struct MMA_Traits<SM90_64x112x32_S32U8U8_RS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<112, 32>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x112x32_S32U8U8_RS_TN_SATURATE = SM90::GMMA::MMA_64x112x32_S32U8U8_RS_TN_SATURATE; 
+
+template <>
+struct MMA_Traits<SM90_64x112x32_S32U8U8_RS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<112, 32>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x144x32_S32U8U8_RS_TN = SM90::GMMA::MMA_64x144x32_S32U8U8_RS_TN; 
+
+template <>
+struct MMA_Traits<SM90_64x144x32_S32U8U8_RS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<144, 32>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x144x32_S32U8U8_RS_TN_SATURATE = SM90::GMMA::MMA_64x144x32_S32U8U8_RS_TN_SATURATE; 
+
+template <>
+struct MMA_Traits<SM90_64x144x32_S32U8U8_RS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<144, 32>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x160x32_S32U8U8_RS_TN = SM90::GMMA::MMA_64x160x32_S32U8U8_RS_TN; 
+
+template <>
+struct MMA_Traits<SM90_64x160x32_S32U8U8_RS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<160, 32>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x160x32_S32U8U8_RS_TN_SATURATE = SM90::GMMA::MMA_64x160x32_S32U8U8_RS_TN_SATURATE; 
+
+template <>
+struct MMA_Traits<SM90_64x160x32_S32U8U8_RS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<160, 32>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x176x32_S32U8U8_RS_TN = SM90::GMMA::MMA_64x176x32_S32U8U8_RS_TN; 
+
+template <>
+struct MMA_Traits<SM90_64x176x32_S32U8U8_RS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<176, 32>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x176x32_S32U8U8_RS_TN_SATURATE = SM90::GMMA::MMA_64x176x32_S32U8U8_RS_TN_SATURATE; 
+
+template <>
+struct MMA_Traits<SM90_64x176x32_S32U8U8_RS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<176, 32>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x208x32_S32U8U8_RS_TN = SM90::GMMA::MMA_64x208x32_S32U8U8_RS_TN; 
+
+template <>
+struct MMA_Traits<SM90_64x208x32_S32U8U8_RS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<208, 32>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x208x32_S32U8U8_RS_TN_SATURATE = SM90::GMMA::MMA_64x208x32_S32U8U8_RS_TN_SATURATE; 
+
+template <>
+struct MMA_Traits<SM90_64x208x32_S32U8U8_RS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<208, 32>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x224x32_S32U8U8_RS_TN = SM90::GMMA::MMA_64x224x32_S32U8U8_RS_TN; 
+
+template <>
+struct MMA_Traits<SM90_64x224x32_S32U8U8_RS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<224, 32>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x224x32_S32U8U8_RS_TN_SATURATE = SM90::GMMA::MMA_64x224x32_S32U8U8_RS_TN_SATURATE; 
+
+template <>
+struct MMA_Traits<SM90_64x224x32_S32U8U8_RS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<224, 32>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x240x32_S32U8U8_RS_TN = SM90::GMMA::MMA_64x240x32_S32U8U8_RS_TN; 
+
+template <>
+struct MMA_Traits<SM90_64x240x32_S32U8U8_RS_TN>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<240, 32>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+using SM90_64x240x32_S32U8U8_RS_TN_SATURATE = SM90::GMMA::MMA_64x240x32_S32U8U8_RS_TN_SATURATE; 
+
+template <>
+struct MMA_Traits<SM90_64x240x32_S32U8U8_RS_TN_SATURATE>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = uint8_t;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<240, 32>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x24x32_F16E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x24x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x24x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 24, 32>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x24x32_F16E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x24x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x24x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 24, 32>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x24x32_F32E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x24x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x24x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 24, 32>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x24x32_F32E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x24x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x24x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 24, 32>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x40x32_F16E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x40x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x40x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_40,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 40, 32>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x40x32_F16E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x40x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x40x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_40,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 40, 32>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x40x32_F32E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x40x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x40x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_40,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 40, 32>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x40x32_F32E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x40x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x40x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_40,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 40, 32>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x48x32_F16E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x48x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x48x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 48, 32>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x48x32_F16E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x48x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x48x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 48, 32>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x48x32_F32E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x48x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x48x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 48, 32>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x48x32_F32E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x48x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x48x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 48, 32>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x56x32_F16E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x56x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x56x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_56,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 56, 32>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x56x32_F16E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x56x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x56x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_56,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 56, 32>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x56x32_F32E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x56x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x56x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_56,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 56, 32>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x56x32_F32E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x56x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x56x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_56,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 56, 32>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x72x32_F16E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x72x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x72x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_72,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 72, 32>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x72x32_F16E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x72x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x72x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_72,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 72, 32>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x72x32_F32E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x72x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x72x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_72,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 72, 32>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x72x32_F32E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x72x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x72x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_72,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 72, 32>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x80x32_F16E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x80x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x80x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 80, 32>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x80x32_F16E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x80x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x80x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 80, 32>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x80x32_F32E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x80x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x80x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 80, 32>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x80x32_F32E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x80x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x80x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 80, 32>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x88x32_F16E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x88x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x88x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_88,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 88, 32>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x88x32_F16E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x88x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x88x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_88,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 88, 32>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x88x32_F32E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x88x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x88x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_88,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 88, 32>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x88x32_F32E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x88x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x88x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_88,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 88, 32>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x104x32_F16E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x104x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x104x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_104,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<104, 32>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x104x32_F16E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x104x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x104x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_104,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<104, 32>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x104x32_F32E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x104x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x104x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_104,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<104, 32>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x104x32_F32E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x104x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x104x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_104,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<104, 32>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x112x32_F16E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x112x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x112x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<112, 32>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x112x32_F16E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x112x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x112x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<112, 32>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x112x32_F32E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x112x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x112x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<112, 32>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x112x32_F32E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x112x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x112x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<112, 32>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x120x32_F16E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x120x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x120x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_120,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<120, 32>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x120x32_F16E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x120x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x120x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_120,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<120, 32>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x120x32_F32E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x120x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x120x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_120,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<120, 32>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x120x32_F32E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x120x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x120x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_120,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<120, 32>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x136x32_F16E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x136x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x136x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_136,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<136, 32>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x136x32_F16E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x136x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x136x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_136,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<136, 32>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x136x32_F32E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x136x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x136x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_136,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<136, 32>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x136x32_F32E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x136x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x136x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_136,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<136, 32>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x144x32_F16E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x144x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x144x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<144, 32>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x144x32_F16E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x144x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x144x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<144, 32>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x144x32_F32E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x144x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x144x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<144, 32>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x144x32_F32E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x144x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x144x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<144, 32>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x152x32_F16E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x152x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x152x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_152,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<152, 32>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x152x32_F16E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x152x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x152x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_152,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<152, 32>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x152x32_F32E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x152x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x152x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_152,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<152, 32>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x152x32_F32E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x152x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x152x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_152,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<152, 32>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x160x32_F16E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x160x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x160x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<160, 32>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x160x32_F16E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x160x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x160x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<160, 32>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x160x32_F32E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x160x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x160x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<160, 32>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x160x32_F32E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x160x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x160x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<160, 32>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x168x32_F16E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x168x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x168x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_168,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<168, 32>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x168x32_F16E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x168x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x168x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_168,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<168, 32>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x168x32_F32E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x168x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x168x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_168,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<168, 32>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x168x32_F32E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x168x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x168x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_168,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<168, 32>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x176x32_F16E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x176x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x176x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<176, 32>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x176x32_F16E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x176x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x176x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<176, 32>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x176x32_F32E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x176x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x176x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<176, 32>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x176x32_F32E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x176x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x176x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<176, 32>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x184x32_F16E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x184x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x184x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_184,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<184, 32>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x184x32_F16E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x184x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x184x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_184,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<184, 32>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x184x32_F32E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x184x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x184x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_184,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<184, 32>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x184x32_F32E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x184x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x184x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_184,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<184, 32>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x200x32_F16E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x200x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x200x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_200,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<200, 32>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x200x32_F16E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x200x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x200x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_200,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<200, 32>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x200x32_F32E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x200x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x200x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_200,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<200, 32>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x200x32_F32E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x200x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x200x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_200,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<200, 32>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x208x32_F16E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x208x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x208x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<208, 32>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x208x32_F16E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x208x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x208x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<208, 32>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x208x32_F32E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x208x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x208x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<208, 32>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x208x32_F32E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x208x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x208x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<208, 32>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x216x32_F16E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x216x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x216x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_216,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<216, 32>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x216x32_F16E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x216x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x216x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_216,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<216, 32>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x216x32_F32E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x216x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x216x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_216,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<216, 32>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x216x32_F32E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x216x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x216x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_216,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<216, 32>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x224x32_F16E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x224x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x224x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<224, 32>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x224x32_F16E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x224x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x224x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<224, 32>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x224x32_F32E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x224x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x224x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<224, 32>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x224x32_F32E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x224x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x224x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<224, 32>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x232x32_F16E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x232x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x232x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_232,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<232, 32>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x232x32_F16E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x232x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x232x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_232,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<232, 32>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x232x32_F32E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x232x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x232x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_232,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<232, 32>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x232x32_F32E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x232x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x232x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_232,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<232, 32>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x240x32_F16E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x240x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x240x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<240, 32>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x240x32_F16E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x240x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x240x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<240, 32>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x240x32_F32E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x240x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x240x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<240, 32>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x240x32_F32E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x240x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x240x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<240, 32>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x248x32_F16E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x248x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x248x32_F16E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_248,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<248, 32>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x248x32_F16E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x248x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x248x32_F16E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_248,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<248, 32>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x248x32_F32E4M3E4M3_SS_TN = SM90::GMMA::MMA_64x248x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x248x32_F32E4M3E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_248,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<248, 32>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x248x32_F32E4M3E4M3_RS_TN = SM90::GMMA::MMA_64x248x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x248x32_F32E4M3E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_248,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<248, 32>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x24x32_F16E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x24x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x24x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 24, 32>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x24x32_F16E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x24x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x24x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 24, 32>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x24x32_F32E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x24x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x24x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 24, 32>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x24x32_F32E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x24x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x24x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 24, 32>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x40x32_F16E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x40x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x40x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_40,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 40, 32>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x40x32_F16E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x40x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x40x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_40,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 40, 32>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x40x32_F32E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x40x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x40x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_40,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 40, 32>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x40x32_F32E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x40x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x40x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_40,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 40, 32>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x48x32_F16E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x48x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x48x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 48, 32>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x48x32_F16E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x48x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x48x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 48, 32>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x48x32_F32E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x48x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x48x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 48, 32>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x48x32_F32E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x48x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x48x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 48, 32>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x56x32_F16E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x56x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x56x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_56,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 56, 32>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x56x32_F16E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x56x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x56x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_56,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 56, 32>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x56x32_F32E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x56x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x56x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_56,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 56, 32>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x56x32_F32E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x56x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x56x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_56,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 56, 32>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x72x32_F16E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x72x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x72x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_72,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 72, 32>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x72x32_F16E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x72x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x72x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_72,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 72, 32>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x72x32_F32E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x72x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x72x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_72,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 72, 32>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x72x32_F32E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x72x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x72x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_72,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 72, 32>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x80x32_F16E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x80x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x80x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 80, 32>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x80x32_F16E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x80x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x80x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 80, 32>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x80x32_F32E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x80x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x80x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 80, 32>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x80x32_F32E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x80x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x80x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 80, 32>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x88x32_F16E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x88x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x88x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_88,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 88, 32>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x88x32_F16E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x88x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x88x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_88,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 88, 32>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x88x32_F32E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x88x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x88x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_88,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 88, 32>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x88x32_F32E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x88x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x88x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_88,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 88, 32>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x104x32_F16E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x104x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x104x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_104,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<104, 32>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x104x32_F16E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x104x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x104x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_104,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<104, 32>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x104x32_F32E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x104x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x104x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_104,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<104, 32>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x104x32_F32E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x104x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x104x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_104,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<104, 32>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x112x32_F16E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x112x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x112x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<112, 32>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x112x32_F16E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x112x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x112x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<112, 32>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x112x32_F32E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x112x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x112x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<112, 32>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x112x32_F32E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x112x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x112x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<112, 32>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x120x32_F16E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x120x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x120x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_120,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<120, 32>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x120x32_F16E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x120x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x120x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_120,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<120, 32>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x120x32_F32E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x120x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x120x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_120,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<120, 32>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x120x32_F32E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x120x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x120x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_120,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<120, 32>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x136x32_F16E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x136x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x136x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_136,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<136, 32>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x136x32_F16E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x136x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x136x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_136,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<136, 32>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x136x32_F32E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x136x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x136x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_136,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<136, 32>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x136x32_F32E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x136x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x136x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_136,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<136, 32>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x144x32_F16E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x144x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x144x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<144, 32>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x144x32_F16E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x144x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x144x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<144, 32>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x144x32_F32E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x144x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x144x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<144, 32>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x144x32_F32E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x144x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x144x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<144, 32>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x152x32_F16E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x152x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x152x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_152,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<152, 32>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x152x32_F16E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x152x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x152x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_152,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<152, 32>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x152x32_F32E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x152x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x152x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_152,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<152, 32>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x152x32_F32E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x152x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x152x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_152,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<152, 32>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x160x32_F16E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x160x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x160x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<160, 32>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x160x32_F16E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x160x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x160x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<160, 32>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x160x32_F32E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x160x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x160x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<160, 32>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x160x32_F32E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x160x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x160x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<160, 32>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x168x32_F16E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x168x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x168x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_168,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<168, 32>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x168x32_F16E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x168x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x168x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_168,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<168, 32>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x168x32_F32E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x168x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x168x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_168,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<168, 32>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x168x32_F32E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x168x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x168x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_168,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<168, 32>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x176x32_F16E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x176x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x176x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<176, 32>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x176x32_F16E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x176x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x176x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<176, 32>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x176x32_F32E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x176x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x176x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<176, 32>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x176x32_F32E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x176x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x176x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<176, 32>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x184x32_F16E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x184x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x184x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_184,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<184, 32>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x184x32_F16E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x184x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x184x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_184,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<184, 32>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x184x32_F32E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x184x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x184x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_184,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<184, 32>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x184x32_F32E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x184x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x184x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_184,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<184, 32>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x200x32_F16E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x200x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x200x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_200,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<200, 32>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x200x32_F16E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x200x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x200x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_200,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<200, 32>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x200x32_F32E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x200x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x200x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_200,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<200, 32>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x200x32_F32E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x200x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x200x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_200,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<200, 32>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x208x32_F16E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x208x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x208x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<208, 32>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x208x32_F16E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x208x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x208x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<208, 32>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x208x32_F32E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x208x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x208x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<208, 32>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x208x32_F32E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x208x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x208x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<208, 32>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x216x32_F16E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x216x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x216x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_216,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<216, 32>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x216x32_F16E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x216x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x216x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_216,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<216, 32>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x216x32_F32E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x216x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x216x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_216,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<216, 32>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x216x32_F32E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x216x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x216x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_216,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<216, 32>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x224x32_F16E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x224x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x224x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<224, 32>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x224x32_F16E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x224x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x224x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<224, 32>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x224x32_F32E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x224x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x224x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<224, 32>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x224x32_F32E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x224x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x224x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<224, 32>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x232x32_F16E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x232x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x232x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_232,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<232, 32>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x232x32_F16E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x232x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x232x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_232,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<232, 32>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x232x32_F32E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x232x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x232x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_232,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<232, 32>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x232x32_F32E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x232x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x232x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_232,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<232, 32>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x240x32_F16E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x240x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x240x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<240, 32>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x240x32_F16E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x240x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x240x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<240, 32>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x240x32_F32E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x240x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x240x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<240, 32>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x240x32_F32E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x240x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x240x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<240, 32>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x248x32_F16E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x248x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x248x32_F16E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_248,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<248, 32>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x248x32_F16E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x248x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x248x32_F16E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_248,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<248, 32>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x248x32_F32E4M3E5M2_SS_TN = SM90::GMMA::MMA_64x248x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x248x32_F32E4M3E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_248,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<248, 32>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x248x32_F32E4M3E5M2_RS_TN = SM90::GMMA::MMA_64x248x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x248x32_F32E4M3E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e4m3_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_248,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<248, 32>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x24x32_F16E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x24x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x24x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 24, 32>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x24x32_F16E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x24x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x24x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 24, 32>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x24x32_F32E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x24x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x24x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 24, 32>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x24x32_F32E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x24x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x24x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 24, 32>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x40x32_F16E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x40x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x40x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_40,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 40, 32>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x40x32_F16E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x40x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x40x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_40,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 40, 32>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x40x32_F32E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x40x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x40x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_40,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 40, 32>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x40x32_F32E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x40x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x40x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_40,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 40, 32>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x48x32_F16E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x48x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x48x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 48, 32>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x48x32_F16E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x48x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x48x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 48, 32>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x48x32_F32E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x48x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x48x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 48, 32>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x48x32_F32E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x48x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x48x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 48, 32>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x56x32_F16E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x56x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x56x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_56,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 56, 32>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x56x32_F16E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x56x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x56x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_56,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 56, 32>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x56x32_F32E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x56x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x56x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_56,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 56, 32>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x56x32_F32E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x56x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x56x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_56,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 56, 32>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x72x32_F16E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x72x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x72x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_72,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 72, 32>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x72x32_F16E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x72x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x72x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_72,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 72, 32>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x72x32_F32E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x72x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x72x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_72,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 72, 32>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x72x32_F32E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x72x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x72x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_72,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 72, 32>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x80x32_F16E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x80x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x80x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 80, 32>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x80x32_F16E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x80x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x80x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 80, 32>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x80x32_F32E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x80x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x80x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 80, 32>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x80x32_F32E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x80x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x80x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 80, 32>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x88x32_F16E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x88x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x88x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_88,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 88, 32>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x88x32_F16E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x88x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x88x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_88,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 88, 32>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x88x32_F32E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x88x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x88x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_88,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 88, 32>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x88x32_F32E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x88x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x88x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_88,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 88, 32>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x104x32_F16E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x104x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x104x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_104,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<104, 32>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x104x32_F16E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x104x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x104x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_104,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<104, 32>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x104x32_F32E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x104x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x104x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_104,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<104, 32>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x104x32_F32E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x104x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x104x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_104,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<104, 32>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x112x32_F16E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x112x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x112x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<112, 32>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x112x32_F16E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x112x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x112x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<112, 32>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x112x32_F32E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x112x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x112x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<112, 32>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x112x32_F32E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x112x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x112x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<112, 32>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x120x32_F16E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x120x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x120x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_120,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<120, 32>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x120x32_F16E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x120x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x120x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_120,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<120, 32>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x120x32_F32E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x120x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x120x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_120,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<120, 32>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x120x32_F32E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x120x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x120x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_120,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<120, 32>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x136x32_F16E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x136x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x136x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_136,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<136, 32>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x136x32_F16E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x136x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x136x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_136,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<136, 32>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x136x32_F32E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x136x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x136x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_136,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<136, 32>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x136x32_F32E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x136x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x136x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_136,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<136, 32>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x144x32_F16E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x144x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x144x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<144, 32>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x144x32_F16E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x144x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x144x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<144, 32>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x144x32_F32E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x144x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x144x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<144, 32>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x144x32_F32E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x144x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x144x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<144, 32>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x152x32_F16E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x152x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x152x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_152,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<152, 32>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x152x32_F16E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x152x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x152x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_152,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<152, 32>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x152x32_F32E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x152x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x152x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_152,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<152, 32>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x152x32_F32E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x152x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x152x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_152,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<152, 32>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x160x32_F16E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x160x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x160x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<160, 32>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x160x32_F16E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x160x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x160x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<160, 32>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x160x32_F32E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x160x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x160x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<160, 32>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x160x32_F32E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x160x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x160x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<160, 32>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x168x32_F16E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x168x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x168x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_168,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<168, 32>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x168x32_F16E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x168x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x168x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_168,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<168, 32>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x168x32_F32E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x168x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x168x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_168,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<168, 32>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x168x32_F32E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x168x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x168x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_168,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<168, 32>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x176x32_F16E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x176x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x176x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<176, 32>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x176x32_F16E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x176x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x176x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<176, 32>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x176x32_F32E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x176x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x176x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<176, 32>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x176x32_F32E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x176x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x176x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<176, 32>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x184x32_F16E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x184x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x184x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_184,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<184, 32>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x184x32_F16E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x184x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x184x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_184,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<184, 32>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x184x32_F32E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x184x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x184x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_184,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<184, 32>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x184x32_F32E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x184x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x184x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_184,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<184, 32>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x200x32_F16E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x200x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x200x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_200,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<200, 32>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x200x32_F16E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x200x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x200x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_200,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<200, 32>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x200x32_F32E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x200x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x200x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_200,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<200, 32>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x200x32_F32E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x200x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x200x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_200,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<200, 32>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x208x32_F16E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x208x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x208x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<208, 32>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x208x32_F16E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x208x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x208x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<208, 32>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x208x32_F32E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x208x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x208x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<208, 32>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x208x32_F32E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x208x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x208x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<208, 32>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x216x32_F16E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x216x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x216x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_216,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<216, 32>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x216x32_F16E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x216x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x216x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_216,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<216, 32>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x216x32_F32E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x216x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x216x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_216,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<216, 32>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x216x32_F32E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x216x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x216x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_216,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<216, 32>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x224x32_F16E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x224x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x224x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<224, 32>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x224x32_F16E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x224x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x224x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<224, 32>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x224x32_F32E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x224x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x224x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<224, 32>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x224x32_F32E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x224x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x224x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<224, 32>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x232x32_F16E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x232x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x232x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_232,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<232, 32>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x232x32_F16E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x232x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x232x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_232,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<232, 32>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x232x32_F32E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x232x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x232x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_232,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<232, 32>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x232x32_F32E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x232x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x232x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_232,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<232, 32>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x240x32_F16E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x240x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x240x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<240, 32>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x240x32_F16E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x240x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x240x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<240, 32>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x240x32_F32E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x240x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x240x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<240, 32>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x240x32_F32E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x240x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x240x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<240, 32>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x248x32_F16E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x248x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x248x32_F16E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_248,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<248, 32>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x248x32_F16E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x248x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x248x32_F16E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_248,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<248, 32>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x248x32_F32E5M2E4M3_SS_TN = SM90::GMMA::MMA_64x248x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x248x32_F32E5M2E4M3_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_248,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<248, 32>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x248x32_F32E5M2E4M3_RS_TN = SM90::GMMA::MMA_64x248x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x248x32_F32E5M2E4M3_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_248,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<248, 32>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x24x32_F16E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x24x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x24x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 24, 32>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x24x32_F16E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x24x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x24x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 24, 32>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x24x32_F32E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x24x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x24x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 24, 32>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x24x32_F32E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x24x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x24x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 24, 32>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x40x32_F16E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x40x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x40x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_40,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 40, 32>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x40x32_F16E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x40x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x40x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_40,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 40, 32>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x40x32_F32E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x40x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x40x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_40,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 40, 32>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x40x32_F32E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x40x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x40x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_40,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 40, 32>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x48x32_F16E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x48x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x48x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 48, 32>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x48x32_F16E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x48x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x48x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 48, 32>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x48x32_F32E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x48x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x48x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 48, 32>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x48x32_F32E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x48x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x48x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 48, 32>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x56x32_F16E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x56x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x56x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_56,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 56, 32>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x56x32_F16E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x56x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x56x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_56,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 56, 32>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x56x32_F32E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x56x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x56x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_56,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 56, 32>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x56x32_F32E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x56x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x56x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_56,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 56, 32>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x72x32_F16E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x72x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x72x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_72,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 72, 32>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x72x32_F16E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x72x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x72x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_72,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 72, 32>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x72x32_F32E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x72x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x72x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_72,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 72, 32>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x72x32_F32E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x72x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x72x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_72,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 72, 32>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x80x32_F16E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x80x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x80x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 80, 32>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x80x32_F16E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x80x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x80x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 80, 32>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x80x32_F32E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x80x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x80x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 80, 32>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x80x32_F32E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x80x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x80x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 80, 32>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x88x32_F16E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x88x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x88x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_88,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 88, 32>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x88x32_F16E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x88x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x88x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_88,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 88, 32>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x88x32_F32E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x88x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x88x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_88,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout< 88, 32>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x88x32_F32E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x88x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x88x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_88,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout< 88, 32>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x104x32_F16E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x104x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x104x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_104,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<104, 32>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x104x32_F16E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x104x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x104x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_104,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<104, 32>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x104x32_F32E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x104x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x104x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_104,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<104, 32>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x104x32_F32E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x104x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x104x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_104,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<104, 32>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x112x32_F16E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x112x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x112x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<112, 32>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x112x32_F16E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x112x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x112x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<112, 32>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x112x32_F32E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x112x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x112x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<112, 32>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x112x32_F32E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x112x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x112x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<112, 32>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x120x32_F16E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x120x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x120x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_120,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<120, 32>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x120x32_F16E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x120x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x120x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_120,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<120, 32>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x120x32_F32E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x120x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x120x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_120,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<120, 32>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x120x32_F32E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x120x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x120x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_120,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<120, 32>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x136x32_F16E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x136x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x136x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_136,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<136, 32>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x136x32_F16E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x136x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x136x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_136,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<136, 32>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x136x32_F32E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x136x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x136x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_136,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<136, 32>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x136x32_F32E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x136x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x136x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_136,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<136, 32>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x144x32_F16E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x144x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x144x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<144, 32>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x144x32_F16E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x144x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x144x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<144, 32>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x144x32_F32E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x144x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x144x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<144, 32>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x144x32_F32E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x144x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x144x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<144, 32>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x152x32_F16E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x152x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x152x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_152,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<152, 32>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x152x32_F16E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x152x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x152x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_152,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<152, 32>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x152x32_F32E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x152x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x152x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_152,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<152, 32>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x152x32_F32E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x152x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x152x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_152,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<152, 32>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x160x32_F16E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x160x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x160x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<160, 32>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x160x32_F16E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x160x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x160x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<160, 32>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x160x32_F32E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x160x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x160x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<160, 32>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x160x32_F32E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x160x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x160x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<160, 32>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x168x32_F16E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x168x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x168x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_168,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<168, 32>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x168x32_F16E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x168x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x168x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_168,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<168, 32>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x168x32_F32E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x168x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x168x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_168,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<168, 32>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x168x32_F32E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x168x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x168x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_168,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<168, 32>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x176x32_F16E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x176x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x176x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<176, 32>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x176x32_F16E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x176x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x176x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<176, 32>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x176x32_F32E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x176x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x176x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<176, 32>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x176x32_F32E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x176x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x176x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<176, 32>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x184x32_F16E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x184x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x184x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_184,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<184, 32>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x184x32_F16E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x184x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x184x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_184,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<184, 32>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x184x32_F32E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x184x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x184x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_184,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<184, 32>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x184x32_F32E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x184x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x184x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_184,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<184, 32>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x200x32_F16E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x200x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x200x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_200,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<200, 32>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x200x32_F16E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x200x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x200x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_200,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<200, 32>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x200x32_F32E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x200x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x200x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_200,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<200, 32>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x200x32_F32E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x200x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x200x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_200,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<200, 32>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x208x32_F16E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x208x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x208x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<208, 32>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x208x32_F16E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x208x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x208x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<208, 32>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x208x32_F32E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x208x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x208x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<208, 32>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x208x32_F32E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x208x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x208x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<208, 32>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x216x32_F16E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x216x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x216x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_216,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<216, 32>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x216x32_F16E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x216x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x216x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_216,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<216, 32>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x216x32_F32E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x216x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x216x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_216,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<216, 32>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x216x32_F32E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x216x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x216x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_216,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<216, 32>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x224x32_F16E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x224x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x224x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<224, 32>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x224x32_F16E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x224x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x224x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<224, 32>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x224x32_F32E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x224x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x224x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<224, 32>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x224x32_F32E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x224x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x224x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<224, 32>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x232x32_F16E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x232x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x232x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_232,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<232, 32>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x232x32_F16E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x232x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x232x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_232,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<232, 32>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x232x32_F32E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x232x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x232x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_232,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<232, 32>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x232x32_F32E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x232x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x232x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_232,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<232, 32>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x240x32_F16E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x240x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x240x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<240, 32>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x240x32_F16E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x240x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x240x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<240, 32>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x240x32_F32E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x240x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x240x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<240, 32>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x240x32_F32E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x240x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x240x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<240, 32>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x248x32_F16E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x248x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x248x32_F16E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_248,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<248, 32>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x248x32_F16E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x248x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x248x32_F16E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_248,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<248, 32>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x248x32_F32E5M2E5M2_SS_TN = SM90::GMMA::MMA_64x248x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>;
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x248x32_F32E5M2E5M2_SS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_248,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using BLayout = GMMA::ABLayout<248, 32>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <
+  GMMA::ScaleIn  scaleA = GMMA::ScaleIn::One,
+  GMMA::ScaleIn  scaleB = GMMA::ScaleIn::One
+>
+using SM90_64x248x32_F32E5M2E5M2_RS_TN = SM90::GMMA::MMA_64x248x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>; 
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB>
+struct MMA_Traits<SM90_64x248x32_F32E5M2E5M2_RS_TN<scaleA, scaleB>>
+{
+  using ValTypeD = float;
+  using ValTypeA = float_e5m2_t;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_248,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using BLayout = GMMA::ABLayout<248, 32>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+} // end namespace cute
diff --git a/include/cute/atom/mma_traits_sm90_gmma_sparse.hpp b/include/cute/atom/mma_traits_sm90_gmma_sparse.hpp
new file mode 100644
index 0000000000..27c41ad338
--- /dev/null
+++ b/include/cute/atom/mma_traits_sm90_gmma_sparse.hpp
@@ -0,0 +1,7738 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+#pragma once
+
+#include <cute/pointer_sparse.hpp>             // cute::smem_sparse_ptr_flag
+#include <cute/swizzle.hpp>                    // cute::Swizzle
+#include <cute/tensor_impl.hpp>                // cute::Tensor
+#include <cute/arch/mma_sm90_desc.hpp>         // cute::LayoutType
+#include <cute/arch/mma_sm90_gmma_sparse.hpp>  // cute::SM90::SPARSE::GMMA_64x8x32_F16F16F16_SS, etc
+#include <cute/atom/mma_traits_sm90_gmma.hpp>  // cute::GMMA::Layout_*
+#include <cute/atom/mma_traits.hpp>            // cute::MMA_Traits
+#include <cute/layout_composed.hpp>            // cute::ComposedLayout
+#include <cute/numeric/integral_constant.hpp>  // cute::is_static
+
+namespace cute {
+
+namespace SM90::GMMA {
+
+///////////////////////////////////////////
+// Common layouts for GMMA Shared Memory //
+///////////////////////////////////////////
+
+// M|N-major layouts in units of Type and sparsity factor S
+template <class Type, int S>
+using Layout_MN_INTER_SpAtom = ComposedLayout<Swizzle<0,4,3>, smem_sparse_ptr_flag_bits<S,sizeof_bits_v<Type>>,
+                                              decltype(blocked_product(Layout<Shape<_1,Int<S>>>{}, Layout_MN_INTER_Atom<Type>{}.layout_b()))>;
+template <class Type, int S>
+using Layout_MN_SW32_SpAtom  = ComposedLayout<Swizzle<1,4,3>, smem_sparse_ptr_flag_bits<S,sizeof_bits_v<Type>>,
+                                              decltype(blocked_product(Layout<Shape<_1,Int<S>>>{}, Layout_MN_SW32_Atom<Type>{}.layout_b()))>;
+template <class Type, int S>
+using Layout_MN_SW64_SpAtom  = ComposedLayout<Swizzle<2,4,3>, smem_sparse_ptr_flag_bits<S,sizeof_bits_v<Type>>,
+                                              decltype(blocked_product(Layout<Shape<_1,Int<S>>>{}, Layout_MN_SW64_Atom<Type>{}.layout_b()))>;
+template <class Type, int S>
+using Layout_MN_SW128_SpAtom = ComposedLayout<Swizzle<3,4,3>, smem_sparse_ptr_flag_bits<S,sizeof_bits_v<Type>>,
+                                              decltype(blocked_product(Layout<Shape<_1,Int<S>>>{}, Layout_MN_SW128_Atom<Type>{}.layout_b()))>;
+
+// K-major layouts in units of Type and sparsity factor S
+template <class Type, int S>
+using Layout_K_INTER_SpAtom = ComposedLayout<Swizzle<0,4,3>, smem_sparse_ptr_flag_bits<S,sizeof_bits_v<Type>>,
+                                              decltype(blocked_product(Layout<Shape<_1,Int<S>>>{}, Layout_K_INTER_Atom<Type>{}.layout_b()))>;
+template <class Type, int S>
+using Layout_K_SW32_SpAtom  = ComposedLayout<Swizzle<1,4,3>, smem_sparse_ptr_flag_bits<S,sizeof_bits_v<Type>>,
+                                              decltype(blocked_product(Layout<Shape<_1,Int<S>>>{}, Layout_K_SW32_Atom<Type>{}.layout_b()))>;
+template <class Type, int S>
+using Layout_K_SW64_SpAtom  = ComposedLayout<Swizzle<2,4,3>, smem_sparse_ptr_flag_bits<S,sizeof_bits_v<Type>>,
+                                              decltype(blocked_product(Layout<Shape<_1,Int<S>>>{}, Layout_K_SW64_Atom<Type>{}.layout_b()))>;
+template <class Type, int S>
+using Layout_K_SW128_SpAtom = ComposedLayout<Swizzle<3,4,3>, smem_sparse_ptr_flag_bits<S,sizeof_bits_v<Type>>,
+                                              decltype(blocked_product(Layout<Shape<_1,Int<S>>>{}, Layout_K_SW128_Atom<Type>{}.layout_b()))>;
+
+// With GMMA::Major param
+template <class Type, int S, GMMA::Major tnsp>
+using Layout_INTER_SpAtom = typename conditional<tnsp == GMMA::Major::MN,
+                                                 Layout_MN_INTER_SpAtom<Type,S>,
+                                                 Layout_K_INTER_SpAtom<Type,S>>::type;
+template <class Type, int S, GMMA::Major tnsp>
+using Layout_SW32_SpAtom = typename conditional<tnsp == GMMA::Major::MN,
+                                                Layout_MN_SW32_SpAtom<Type,S>,
+                                                Layout_K_SW32_SpAtom<Type,S>>::type;
+template <class Type, int S, GMMA::Major tnsp>
+using Layout_SW64_SpAtom = typename conditional<tnsp == GMMA::Major::MN,
+                                                Layout_MN_SW64_SpAtom<Type,S>,
+                                                Layout_K_SW64_SpAtom<Type,S>>::type;
+template <class Type, int S, GMMA::Major tnsp>
+using Layout_SW128_SpAtom = typename conditional<tnsp == GMMA::Major::MN,
+                                                 Layout_MN_SW128_SpAtom<Type,S>,
+                                                 Layout_K_SW128_SpAtom<Type,S>>::type;
+
+///////////////////////////////////////////////////////////////////////////////
+// Higher level GMMA Descriptor utilities
+///////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major>
+struct sparse_smem_desc : DescriptorIterator {};
+
+} // end namespace SM90::GMMA
+
+// Customization point for creating a cute::GMMAsparse_smem_desc Tensor
+template <SM90::GMMA::Major MajorMode>
+struct MakeTensor<SM90::GMMA::sparse_smem_desc<MajorMode>>
+{
+  // Note that this is the exact same as cute::GMMAsmem_desc above, plus additional static checks.
+  template <class TEngine, class TLayout>
+  CUTE_HOST_DEVICE constexpr auto
+  operator()(Tensor<TEngine,TLayout> const& smem_tensor)
+  {
+    static_assert(is_smem<TEngine>::value, "Expected SMEM Tensor to construct a GMMA Desc Tensor");
+    static_assert(is_sparse<typename TEngine::value_type>::value, "Expected sparse value_type.");
+    static_assert(is_sparse_ptr<TEngine>::value, "Expected sparse iter.");
+    return make_tensor(SM90::GMMA::DescriptorIterator{SM90::GMMA::make_gmma_desc<MajorMode>(tensor<0>(smem_tensor))},
+                       replace<0>(recast<uint128_t const>(smem_tensor).layout(), Layout<_1,_0>{}));
+  }
+};
+
+///////////////////////////////////////////////////////////////////////////////
+//////////////////////////// MMA_TRAITS ///////////////////////////////////////
+///////////////////////////////////////////////////////////////////////////////
+
+namespace SM90::GMMA {
+
+// Metadata layouts
+using ELayout_64x64  = Layout<Shape <Shape <_2,   _2,_8, _4>, Shape <_32>>, 
+                              Stride<Stride<_8,_2048,_1,_16>, Stride<_64>>>;
+
+using ELayout_64x32  = Layout<Shape <Shape <   _2,_2,_8, _4>, Shape <_16,_2>>, 
+                              Stride<Stride<_1024,_0,_1,_16>, Stride<_64,_8>>>;
+
+using ELayout_64x16  = Layout<Shape <Shape <  _2,_2,_8, _4>, Shape < _8,_2>>, 
+                              Stride<Stride<_512,_0,_1,_16>, Stride<_64,_8>>>;
+
+} // namespace SM90::GMMA
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+namespace SM90::GMMA::SPARSE {
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <class MMAOp,
+          class TD, class DLayout,
+          class TA, class ALayout,
+          class TB, class BLayout,
+          class TC, class CLayout>
+CUTE_HOST_DEVICE constexpr void
+mma_unpack(MMA_Traits<MMAOp>   const& traits,
+           Tensor<TD, DLayout>      & D,
+           Tensor<TA, ALayout> const& A_zipped,
+           Tensor<TB, BLayout> const& B,
+           Tensor<TC, CLayout> const& C)
+{
+  static_assert(is_rmem_v<TD>, "Expected registers in MMA_Atom::call");
+  static_assert(is_rmem_v<TA>, "Expected registers in MMA_Atom::call");
+  static_assert(is_rmem_v<TB>, "Expected registers in MMA_Atom::call");
+  static_assert(is_rmem_v<TC>, "Expected registers in MMA_Atom::call");
+
+  using DRegisters = typename MMAOp::DRegisters;
+  using ARegisters = typename MMAOp::ARegisters;
+  using ERegisters = typename MMAOp::ERegisters;
+  using BRegisters = typename MMAOp::BRegisters;
+  using CRegisters = typename MMAOp::CRegisters;
+
+  // Register value types from the MMAOp register arrays
+  using RegTypeD   = typename remove_extent<DRegisters>::type;
+  using RegTypeA   = typename remove_extent<ARegisters>::type;
+  using RegTypeE   = typename remove_extent<ERegisters>::type;
+  using RegTypeB   = typename remove_extent<BRegisters>::type;
+  using RegTypeC   = typename remove_extent<CRegisters>::type;
+
+  constexpr int RegNumA = extent<ARegisters>::value;
+  constexpr int RegNumE = extent<ERegisters>::value;
+  constexpr int RegNumB = extent<BRegisters>::value;
+  constexpr int RegNumC = extent<CRegisters>::value;
+
+  auto [A, E] = unzip_tensor(A_zipped);
+  Tensor rA   = recast<RegTypeA>(A);
+  Tensor rE   = recast<RegTypeE>(E);
+  Tensor rB   = recast<RegTypeB>(B);
+
+  CUTE_STATIC_ASSERT_V(size(rA) == Int<RegNumA>{});
+  CUTE_STATIC_ASSERT_V(size(rE) == Int<RegNumE>{});
+  CUTE_STATIC_ASSERT_V(size(rB) == Int<RegNumB>{});
+
+  static_assert(is_same<RegTypeD, void>::value, "GMMA DRegisters must have void type.");
+  static_assert(is_same<typename TD::value_type, typename TC::value_type>::value, "GMMA C and D value_type must match.");
+  static_assert(is_same<DLayout, CLayout>::value, "GMMA C and D layouts must match.");
+
+  Tensor rC = recast<RegTypeC>(D);  // NOTE: D and C are same, so use mutable D
+
+  CUTE_STATIC_ASSERT_V(size(rC) == Int<RegNumC>{});
+
+  detail::explode(MMAOp::fma,
+                  rA, make_int_sequence<RegNumA>{},
+                  rB, make_int_sequence<RegNumB>{},
+                  rC, make_int_sequence<RegNumC>{},
+                  rE, make_int_sequence<RegNumE>{},
+                  &(traits.accumulate_), seq<0>{});
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+} // namespace SM90::SPARSE
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x8x32_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_8,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<  8, 32>;
+  using CLayout = GMMA::CLayout_64x8;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x8x32_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_8,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<  8, 32>;
+  using CLayout = GMMA::CLayout_64x8;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x16x32_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_16,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 16, 32>;
+  using CLayout = GMMA::CLayout_64x16;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x16x32_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_16,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 16, 32>;
+  using CLayout = GMMA::CLayout_64x16;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x32x32_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_32,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 32, 32>;
+  using CLayout = GMMA::CLayout_64x32;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x32x32_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_32,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 32, 32>;
+  using CLayout = GMMA::CLayout_64x32;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x64x32_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_64,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 64, 32>;
+  using CLayout = GMMA::CLayout_64x64;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x64x32_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_64,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 64, 32>;
+  using CLayout = GMMA::CLayout_64x64;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x96x32_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_96,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 96, 32>;
+  using CLayout = GMMA::CLayout_64x96;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x96x32_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_96,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 96, 32>;
+  using CLayout = GMMA::CLayout_64x96;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x128x32_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_128,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<128, 32>;
+  using CLayout = GMMA::CLayout_64x128;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x128x32_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_128,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<128, 32>;
+  using CLayout = GMMA::CLayout_64x128;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x192x32_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_192,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<192, 32>;
+  using CLayout = GMMA::CLayout_64x192;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x192x32_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_192,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<192, 32>;
+  using CLayout = GMMA::CLayout_64x192;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x256x32_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_256,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<256, 32>;
+  using CLayout = GMMA::CLayout_64x256;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x256x32_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_256,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<256, 32>;
+  using CLayout = GMMA::CLayout_64x256;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x8x32_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_8,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<  8, 32>;
+  using CLayout = GMMA::CLayout_64x8;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x8x32_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_8,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<  8, 32>;
+  using CLayout = GMMA::CLayout_64x8;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x16x32_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_16,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 16, 32>;
+  using CLayout = GMMA::CLayout_64x16;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x16x32_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_16,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 16, 32>;
+  using CLayout = GMMA::CLayout_64x16;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x32x32_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_32,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 32, 32>;
+  using CLayout = GMMA::CLayout_64x32;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x32x32_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_32,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 32, 32>;
+  using CLayout = GMMA::CLayout_64x32;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x64x32_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_64,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 64, 32>;
+  using CLayout = GMMA::CLayout_64x64;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x64x32_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_64,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 64, 32>;
+  using CLayout = GMMA::CLayout_64x64;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x96x32_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_96,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 96, 32>;
+  using CLayout = GMMA::CLayout_64x96;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x96x32_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_96,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 96, 32>;
+  using CLayout = GMMA::CLayout_64x96;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x128x32_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_128,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<128, 32>;
+  using CLayout = GMMA::CLayout_64x128;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x128x32_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_128,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<128, 32>;
+  using CLayout = GMMA::CLayout_64x128;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x192x32_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_192,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<192, 32>;
+  using CLayout = GMMA::CLayout_64x192;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x192x32_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_192,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<192, 32>;
+  using CLayout = GMMA::CLayout_64x192;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x256x32_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_256,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<256, 32>;
+  using CLayout = GMMA::CLayout_64x256;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x256x32_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_256,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<256, 32>;
+  using CLayout = GMMA::CLayout_64x256;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x8x32_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_8,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<  8, 32>;
+  using CLayout = GMMA::CLayout_64x8;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x8x32_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_8,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<  8, 32>;
+  using CLayout = GMMA::CLayout_64x8;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x16x32_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_16,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 16, 32>;
+  using CLayout = GMMA::CLayout_64x16;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x16x32_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_16,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 16, 32>;
+  using CLayout = GMMA::CLayout_64x16;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x32x32_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_32,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 32, 32>;
+  using CLayout = GMMA::CLayout_64x32;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x32x32_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_32,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 32, 32>;
+  using CLayout = GMMA::CLayout_64x32;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x64x32_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_64,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 64, 32>;
+  using CLayout = GMMA::CLayout_64x64;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x64x32_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_64,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 64, 32>;
+  using CLayout = GMMA::CLayout_64x64;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x96x32_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_96,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 96, 32>;
+  using CLayout = GMMA::CLayout_64x96;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x96x32_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_96,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 96, 32>;
+  using CLayout = GMMA::CLayout_64x96;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x128x32_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_128,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<128, 32>;
+  using CLayout = GMMA::CLayout_64x128;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x128x32_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_128,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<128, 32>;
+  using CLayout = GMMA::CLayout_64x128;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x192x32_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_192,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<192, 32>;
+  using CLayout = GMMA::CLayout_64x192;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x192x32_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_192,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<192, 32>;
+  using CLayout = GMMA::CLayout_64x192;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x256x32_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_256,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<256, 32>;
+  using CLayout = GMMA::CLayout_64x256;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x256x32_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_256,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<256, 32>;
+  using CLayout = GMMA::CLayout_64x256;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x8x16_F32TF32TF32_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_8,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout<  8, 16>;
+  using CLayout = GMMA::CLayout_64x8;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x8x16_F32TF32TF32_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_8,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout<  8, 16>;
+  using CLayout = GMMA::CLayout_64x8;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x16x16_F32TF32TF32_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_16,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout< 16, 16>;
+  using CLayout = GMMA::CLayout_64x16;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x16x16_F32TF32TF32_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_16,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout< 16, 16>;
+  using CLayout = GMMA::CLayout_64x16;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x32x16_F32TF32TF32_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_32,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout< 32, 16>;
+  using CLayout = GMMA::CLayout_64x32;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x32x16_F32TF32TF32_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_32,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout< 32, 16>;
+  using CLayout = GMMA::CLayout_64x32;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x64x16_F32TF32TF32_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_64,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout< 64, 16>;
+  using CLayout = GMMA::CLayout_64x64;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x64x16_F32TF32TF32_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_64,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout< 64, 16>;
+  using CLayout = GMMA::CLayout_64x64;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x96x16_F32TF32TF32_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_96,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout< 96, 16>;
+  using CLayout = GMMA::CLayout_64x96;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x96x16_F32TF32TF32_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_96,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout< 96, 16>;
+  using CLayout = GMMA::CLayout_64x96;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x128x16_F32TF32TF32_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_128,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout<128, 16>;
+  using CLayout = GMMA::CLayout_64x128;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x128x16_F32TF32TF32_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_128,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout<128, 16>;
+  using CLayout = GMMA::CLayout_64x128;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x192x16_F32TF32TF32_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_192,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout<192, 16>;
+  using CLayout = GMMA::CLayout_64x192;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x192x16_F32TF32TF32_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_192,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout<192, 16>;
+  using CLayout = GMMA::CLayout_64x192;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x256x16_F32TF32TF32_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_256,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout<256, 16>;
+  using CLayout = GMMA::CLayout_64x256;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x256x16_F32TF32TF32_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_256,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout<256, 16>;
+  using CLayout = GMMA::CLayout_64x256;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x8x64_S32S8S8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_8,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<  8, 64>;
+  using CLayout = GMMA::CLayout_64x8;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x8x64_S32S8S8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_8,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<  8, 64>;
+  using CLayout = GMMA::CLayout_64x8;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x16x64_S32S8S8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_16,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 16, 64>;
+  using CLayout = GMMA::CLayout_64x16;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x16x64_S32S8S8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_16,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 16, 64>;
+  using CLayout = GMMA::CLayout_64x16;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x32x64_S32S8S8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_32,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 32, 64>;
+  using CLayout = GMMA::CLayout_64x32;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x32x64_S32S8S8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_32,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 32, 64>;
+  using CLayout = GMMA::CLayout_64x32;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x64x64_S32S8S8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_64,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 64, 64>;
+  using CLayout = GMMA::CLayout_64x64;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x64x64_S32S8S8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_64,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 64, 64>;
+  using CLayout = GMMA::CLayout_64x64;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x96x64_S32S8S8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_96,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 96, 64>;
+  using CLayout = GMMA::CLayout_64x96;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x96x64_S32S8S8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_96,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 96, 64>;
+  using CLayout = GMMA::CLayout_64x96;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x128x64_S32S8S8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_128,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<128, 64>;
+  using CLayout = GMMA::CLayout_64x128;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x128x64_S32S8S8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_128,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<128, 64>;
+  using CLayout = GMMA::CLayout_64x128;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x192x64_S32S8S8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_192,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<192, 64>;
+  using CLayout = GMMA::CLayout_64x192;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x192x64_S32S8S8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_192,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<192, 64>;
+  using CLayout = GMMA::CLayout_64x192;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x256x64_S32S8S8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_256,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<256, 64>;
+  using CLayout = GMMA::CLayout_64x256;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x256x64_S32S8S8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_256,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<256, 64>;
+  using CLayout = GMMA::CLayout_64x256;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x8x64_S32S8S8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_8,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<  8, 64>;
+  using CLayout = GMMA::CLayout_64x8;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x8x64_S32S8S8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_8,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<  8, 64>;
+  using CLayout = GMMA::CLayout_64x8;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x16x64_S32S8S8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_16,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 16, 64>;
+  using CLayout = GMMA::CLayout_64x16;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x16x64_S32S8S8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_16,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 16, 64>;
+  using CLayout = GMMA::CLayout_64x16;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x32x64_S32S8S8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_32,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 32, 64>;
+  using CLayout = GMMA::CLayout_64x32;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x32x64_S32S8S8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_32,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 32, 64>;
+  using CLayout = GMMA::CLayout_64x32;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x64x64_S32S8S8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_64,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 64, 64>;
+  using CLayout = GMMA::CLayout_64x64;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x64x64_S32S8S8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_64,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 64, 64>;
+  using CLayout = GMMA::CLayout_64x64;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x96x64_S32S8S8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_96,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 96, 64>;
+  using CLayout = GMMA::CLayout_64x96;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x96x64_S32S8S8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_96,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 96, 64>;
+  using CLayout = GMMA::CLayout_64x96;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x128x64_S32S8S8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_128,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<128, 64>;
+  using CLayout = GMMA::CLayout_64x128;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x128x64_S32S8S8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_128,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<128, 64>;
+  using CLayout = GMMA::CLayout_64x128;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x192x64_S32S8S8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_192,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<192, 64>;
+  using CLayout = GMMA::CLayout_64x192;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x192x64_S32S8S8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_192,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<192, 64>;
+  using CLayout = GMMA::CLayout_64x192;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x256x64_S32S8S8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_256,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<256, 64>;
+  using CLayout = GMMA::CLayout_64x256;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x256x64_S32S8S8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_256,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<256, 64>;
+  using CLayout = GMMA::CLayout_64x256;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x8x64_S32S8U8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_8,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<  8, 64>;
+  using CLayout = GMMA::CLayout_64x8;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x8x64_S32S8U8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_8,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<  8, 64>;
+  using CLayout = GMMA::CLayout_64x8;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x16x64_S32S8U8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_16,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 16, 64>;
+  using CLayout = GMMA::CLayout_64x16;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x16x64_S32S8U8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_16,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 16, 64>;
+  using CLayout = GMMA::CLayout_64x16;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x32x64_S32S8U8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_32,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 32, 64>;
+  using CLayout = GMMA::CLayout_64x32;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x32x64_S32S8U8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_32,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 32, 64>;
+  using CLayout = GMMA::CLayout_64x32;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x64x64_S32S8U8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_64,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 64, 64>;
+  using CLayout = GMMA::CLayout_64x64;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x64x64_S32S8U8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_64,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 64, 64>;
+  using CLayout = GMMA::CLayout_64x64;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x96x64_S32S8U8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_96,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 96, 64>;
+  using CLayout = GMMA::CLayout_64x96;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x96x64_S32S8U8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_96,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 96, 64>;
+  using CLayout = GMMA::CLayout_64x96;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x128x64_S32S8U8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_128,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<128, 64>;
+  using CLayout = GMMA::CLayout_64x128;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x128x64_S32S8U8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_128,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<128, 64>;
+  using CLayout = GMMA::CLayout_64x128;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x192x64_S32S8U8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_192,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<192, 64>;
+  using CLayout = GMMA::CLayout_64x192;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x192x64_S32S8U8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_192,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<192, 64>;
+  using CLayout = GMMA::CLayout_64x192;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x256x64_S32S8U8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_256,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<256, 64>;
+  using CLayout = GMMA::CLayout_64x256;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x256x64_S32S8U8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_256,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<256, 64>;
+  using CLayout = GMMA::CLayout_64x256;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x8x64_S32S8U8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_8,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<  8, 64>;
+  using CLayout = GMMA::CLayout_64x8;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x8x64_S32S8U8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_8,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<  8, 64>;
+  using CLayout = GMMA::CLayout_64x8;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x16x64_S32S8U8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_16,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 16, 64>;
+  using CLayout = GMMA::CLayout_64x16;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x16x64_S32S8U8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_16,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 16, 64>;
+  using CLayout = GMMA::CLayout_64x16;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x32x64_S32S8U8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_32,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 32, 64>;
+  using CLayout = GMMA::CLayout_64x32;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x32x64_S32S8U8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_32,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 32, 64>;
+  using CLayout = GMMA::CLayout_64x32;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x64x64_S32S8U8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_64,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 64, 64>;
+  using CLayout = GMMA::CLayout_64x64;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x64x64_S32S8U8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_64,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 64, 64>;
+  using CLayout = GMMA::CLayout_64x64;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x96x64_S32S8U8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_96,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 96, 64>;
+  using CLayout = GMMA::CLayout_64x96;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x96x64_S32S8U8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_96,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 96, 64>;
+  using CLayout = GMMA::CLayout_64x96;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x128x64_S32S8U8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_128,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<128, 64>;
+  using CLayout = GMMA::CLayout_64x128;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x128x64_S32S8U8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_128,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<128, 64>;
+  using CLayout = GMMA::CLayout_64x128;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x192x64_S32S8U8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_192,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<192, 64>;
+  using CLayout = GMMA::CLayout_64x192;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x192x64_S32S8U8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_192,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<192, 64>;
+  using CLayout = GMMA::CLayout_64x192;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x256x64_S32S8U8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_256,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<256, 64>;
+  using CLayout = GMMA::CLayout_64x256;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x256x64_S32S8U8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_256,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<256, 64>;
+  using CLayout = GMMA::CLayout_64x256;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x8x64_S32U8S8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_8,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<  8, 64>;
+  using CLayout = GMMA::CLayout_64x8;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x8x64_S32U8S8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_8,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<  8, 64>;
+  using CLayout = GMMA::CLayout_64x8;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x16x64_S32U8S8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_16,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 16, 64>;
+  using CLayout = GMMA::CLayout_64x16;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x16x64_S32U8S8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_16,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 16, 64>;
+  using CLayout = GMMA::CLayout_64x16;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x32x64_S32U8S8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_32,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 32, 64>;
+  using CLayout = GMMA::CLayout_64x32;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x32x64_S32U8S8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_32,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 32, 64>;
+  using CLayout = GMMA::CLayout_64x32;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x64x64_S32U8S8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_64,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 64, 64>;
+  using CLayout = GMMA::CLayout_64x64;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x64x64_S32U8S8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_64,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 64, 64>;
+  using CLayout = GMMA::CLayout_64x64;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x96x64_S32U8S8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_96,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 96, 64>;
+  using CLayout = GMMA::CLayout_64x96;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x96x64_S32U8S8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_96,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 96, 64>;
+  using CLayout = GMMA::CLayout_64x96;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x128x64_S32U8S8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_128,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<128, 64>;
+  using CLayout = GMMA::CLayout_64x128;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x128x64_S32U8S8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_128,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<128, 64>;
+  using CLayout = GMMA::CLayout_64x128;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x192x64_S32U8S8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_192,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<192, 64>;
+  using CLayout = GMMA::CLayout_64x192;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x192x64_S32U8S8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_192,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<192, 64>;
+  using CLayout = GMMA::CLayout_64x192;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x256x64_S32U8S8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_256,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<256, 64>;
+  using CLayout = GMMA::CLayout_64x256;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x256x64_S32U8S8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_256,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<256, 64>;
+  using CLayout = GMMA::CLayout_64x256;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x8x64_S32U8S8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_8,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<  8, 64>;
+  using CLayout = GMMA::CLayout_64x8;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x8x64_S32U8S8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_8,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<  8, 64>;
+  using CLayout = GMMA::CLayout_64x8;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x16x64_S32U8S8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_16,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 16, 64>;
+  using CLayout = GMMA::CLayout_64x16;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x16x64_S32U8S8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_16,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 16, 64>;
+  using CLayout = GMMA::CLayout_64x16;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x32x64_S32U8S8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_32,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 32, 64>;
+  using CLayout = GMMA::CLayout_64x32;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x32x64_S32U8S8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_32,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 32, 64>;
+  using CLayout = GMMA::CLayout_64x32;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x64x64_S32U8S8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_64,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 64, 64>;
+  using CLayout = GMMA::CLayout_64x64;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x64x64_S32U8S8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_64,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 64, 64>;
+  using CLayout = GMMA::CLayout_64x64;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x96x64_S32U8S8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_96,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 96, 64>;
+  using CLayout = GMMA::CLayout_64x96;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x96x64_S32U8S8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_96,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 96, 64>;
+  using CLayout = GMMA::CLayout_64x96;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x128x64_S32U8S8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_128,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<128, 64>;
+  using CLayout = GMMA::CLayout_64x128;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x128x64_S32U8S8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_128,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<128, 64>;
+  using CLayout = GMMA::CLayout_64x128;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x192x64_S32U8S8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_192,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<192, 64>;
+  using CLayout = GMMA::CLayout_64x192;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x192x64_S32U8S8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_192,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<192, 64>;
+  using CLayout = GMMA::CLayout_64x192;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x256x64_S32U8S8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_256,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<256, 64>;
+  using CLayout = GMMA::CLayout_64x256;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x256x64_S32U8S8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_256,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<256, 64>;
+  using CLayout = GMMA::CLayout_64x256;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x8x64_S32U8U8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_8,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<  8, 64>;
+  using CLayout = GMMA::CLayout_64x8;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x8x64_S32U8U8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_8,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<  8, 64>;
+  using CLayout = GMMA::CLayout_64x8;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x16x64_S32U8U8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_16,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 16, 64>;
+  using CLayout = GMMA::CLayout_64x16;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x16x64_S32U8U8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_16,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 16, 64>;
+  using CLayout = GMMA::CLayout_64x16;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x32x64_S32U8U8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_32,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 32, 64>;
+  using CLayout = GMMA::CLayout_64x32;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x32x64_S32U8U8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_32,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 32, 64>;
+  using CLayout = GMMA::CLayout_64x32;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x64x64_S32U8U8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_64,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 64, 64>;
+  using CLayout = GMMA::CLayout_64x64;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x64x64_S32U8U8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_64,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 64, 64>;
+  using CLayout = GMMA::CLayout_64x64;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x96x64_S32U8U8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_96,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 96, 64>;
+  using CLayout = GMMA::CLayout_64x96;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x96x64_S32U8U8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_96,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 96, 64>;
+  using CLayout = GMMA::CLayout_64x96;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x128x64_S32U8U8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_128,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<128, 64>;
+  using CLayout = GMMA::CLayout_64x128;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x128x64_S32U8U8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_128,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<128, 64>;
+  using CLayout = GMMA::CLayout_64x128;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x192x64_S32U8U8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_192,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<192, 64>;
+  using CLayout = GMMA::CLayout_64x192;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x192x64_S32U8U8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_192,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<192, 64>;
+  using CLayout = GMMA::CLayout_64x192;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x256x64_S32U8U8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_256,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<256, 64>;
+  using CLayout = GMMA::CLayout_64x256;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x256x64_S32U8U8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_256,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<256, 64>;
+  using CLayout = GMMA::CLayout_64x256;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x8x64_S32U8U8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_8,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<  8, 64>;
+  using CLayout = GMMA::CLayout_64x8;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x8x64_S32U8U8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_8,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<  8, 64>;
+  using CLayout = GMMA::CLayout_64x8;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x16x64_S32U8U8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_16,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 16, 64>;
+  using CLayout = GMMA::CLayout_64x16;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x16x64_S32U8U8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_16,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 16, 64>;
+  using CLayout = GMMA::CLayout_64x16;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x32x64_S32U8U8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_32,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 32, 64>;
+  using CLayout = GMMA::CLayout_64x32;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x32x64_S32U8U8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_32,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 32, 64>;
+  using CLayout = GMMA::CLayout_64x32;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x64x64_S32U8U8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_64,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 64, 64>;
+  using CLayout = GMMA::CLayout_64x64;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x64x64_S32U8U8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_64,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 64, 64>;
+  using CLayout = GMMA::CLayout_64x64;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x96x64_S32U8U8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_96,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 96, 64>;
+  using CLayout = GMMA::CLayout_64x96;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x96x64_S32U8U8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_96,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 96, 64>;
+  using CLayout = GMMA::CLayout_64x96;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x128x64_S32U8U8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_128,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<128, 64>;
+  using CLayout = GMMA::CLayout_64x128;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x128x64_S32U8U8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_128,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<128, 64>;
+  using CLayout = GMMA::CLayout_64x128;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x192x64_S32U8U8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_192,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<192, 64>;
+  using CLayout = GMMA::CLayout_64x192;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x192x64_S32U8U8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_192,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<192, 64>;
+  using CLayout = GMMA::CLayout_64x192;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x256x64_S32U8U8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_256,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<256, 64>;
+  using CLayout = GMMA::CLayout_64x256;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x256x64_S32U8U8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_256,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<256, 64>;
+  using CLayout = GMMA::CLayout_64x256;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x8x64_F16E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_8,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<  8, 64>;
+  using CLayout = GMMA::CLayout_64x8;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x8x64_F16E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_8,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<  8, 64>;
+  using CLayout = GMMA::CLayout_64x8;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x8x64_F32E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_8,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<  8, 64>;
+  using CLayout = GMMA::CLayout_64x8;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x8x64_F32E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_8,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<  8, 64>;
+  using CLayout = GMMA::CLayout_64x8;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x16x64_F16E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_16,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 16, 64>;
+  using CLayout = GMMA::CLayout_64x16;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x16x64_F16E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_16,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 16, 64>;
+  using CLayout = GMMA::CLayout_64x16;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x16x64_F32E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_16,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 16, 64>;
+  using CLayout = GMMA::CLayout_64x16;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x16x64_F32E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_16,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 16, 64>;
+  using CLayout = GMMA::CLayout_64x16;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x32x64_F16E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_32,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 32, 64>;
+  using CLayout = GMMA::CLayout_64x32;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x32x64_F16E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_32,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 32, 64>;
+  using CLayout = GMMA::CLayout_64x32;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x32x64_F32E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_32,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 32, 64>;
+  using CLayout = GMMA::CLayout_64x32;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x32x64_F32E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_32,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 32, 64>;
+  using CLayout = GMMA::CLayout_64x32;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x64x64_F16E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_64,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 64, 64>;
+  using CLayout = GMMA::CLayout_64x64;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x64x64_F16E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_64,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 64, 64>;
+  using CLayout = GMMA::CLayout_64x64;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x64x64_F32E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_64,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 64, 64>;
+  using CLayout = GMMA::CLayout_64x64;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x64x64_F32E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_64,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 64, 64>;
+  using CLayout = GMMA::CLayout_64x64;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x96x64_F16E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_96,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 96, 64>;
+  using CLayout = GMMA::CLayout_64x96;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x96x64_F16E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_96,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 96, 64>;
+  using CLayout = GMMA::CLayout_64x96;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x96x64_F32E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_96,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 96, 64>;
+  using CLayout = GMMA::CLayout_64x96;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x96x64_F32E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_96,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 96, 64>;
+  using CLayout = GMMA::CLayout_64x96;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x128x64_F16E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_128,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<128, 64>;
+  using CLayout = GMMA::CLayout_64x128;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x128x64_F16E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_128,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<128, 64>;
+  using CLayout = GMMA::CLayout_64x128;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x128x64_F32E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_128,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<128, 64>;
+  using CLayout = GMMA::CLayout_64x128;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x128x64_F32E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_128,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<128, 64>;
+  using CLayout = GMMA::CLayout_64x128;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x192x64_F16E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_192,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<192, 64>;
+  using CLayout = GMMA::CLayout_64x192;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x192x64_F16E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_192,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<192, 64>;
+  using CLayout = GMMA::CLayout_64x192;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x192x64_F32E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_192,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<192, 64>;
+  using CLayout = GMMA::CLayout_64x192;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x192x64_F32E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_192,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<192, 64>;
+  using CLayout = GMMA::CLayout_64x192;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x256x64_F16E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_256,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<256, 64>;
+  using CLayout = GMMA::CLayout_64x256;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x256x64_F16E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_256,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<256, 64>;
+  using CLayout = GMMA::CLayout_64x256;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x256x64_F32E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_256,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<256, 64>;
+  using CLayout = GMMA::CLayout_64x256;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x256x64_F32E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_256,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<256, 64>;
+  using CLayout = GMMA::CLayout_64x256;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x8x64_F16E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_8,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<  8, 64>;
+  using CLayout = GMMA::CLayout_64x8;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x8x64_F16E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_8,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<  8, 64>;
+  using CLayout = GMMA::CLayout_64x8;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x8x64_F32E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_8,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<  8, 64>;
+  using CLayout = GMMA::CLayout_64x8;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x8x64_F32E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_8,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<  8, 64>;
+  using CLayout = GMMA::CLayout_64x8;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x16x64_F16E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_16,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 16, 64>;
+  using CLayout = GMMA::CLayout_64x16;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x16x64_F16E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_16,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 16, 64>;
+  using CLayout = GMMA::CLayout_64x16;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x16x64_F32E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_16,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 16, 64>;
+  using CLayout = GMMA::CLayout_64x16;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x16x64_F32E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_16,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 16, 64>;
+  using CLayout = GMMA::CLayout_64x16;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x32x64_F16E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_32,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 32, 64>;
+  using CLayout = GMMA::CLayout_64x32;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x32x64_F16E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_32,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 32, 64>;
+  using CLayout = GMMA::CLayout_64x32;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x32x64_F32E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_32,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 32, 64>;
+  using CLayout = GMMA::CLayout_64x32;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x32x64_F32E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_32,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 32, 64>;
+  using CLayout = GMMA::CLayout_64x32;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x64x64_F16E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_64,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 64, 64>;
+  using CLayout = GMMA::CLayout_64x64;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x64x64_F16E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_64,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 64, 64>;
+  using CLayout = GMMA::CLayout_64x64;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x64x64_F32E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_64,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 64, 64>;
+  using CLayout = GMMA::CLayout_64x64;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x64x64_F32E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_64,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 64, 64>;
+  using CLayout = GMMA::CLayout_64x64;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x96x64_F16E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_96,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 96, 64>;
+  using CLayout = GMMA::CLayout_64x96;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x96x64_F16E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_96,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 96, 64>;
+  using CLayout = GMMA::CLayout_64x96;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x96x64_F32E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_96,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 96, 64>;
+  using CLayout = GMMA::CLayout_64x96;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x96x64_F32E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_96,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 96, 64>;
+  using CLayout = GMMA::CLayout_64x96;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x128x64_F16E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_128,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<128, 64>;
+  using CLayout = GMMA::CLayout_64x128;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x128x64_F16E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_128,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<128, 64>;
+  using CLayout = GMMA::CLayout_64x128;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x128x64_F32E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_128,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<128, 64>;
+  using CLayout = GMMA::CLayout_64x128;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x128x64_F32E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_128,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<128, 64>;
+  using CLayout = GMMA::CLayout_64x128;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x192x64_F16E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_192,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<192, 64>;
+  using CLayout = GMMA::CLayout_64x192;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x192x64_F16E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_192,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<192, 64>;
+  using CLayout = GMMA::CLayout_64x192;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x192x64_F32E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_192,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<192, 64>;
+  using CLayout = GMMA::CLayout_64x192;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x192x64_F32E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_192,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<192, 64>;
+  using CLayout = GMMA::CLayout_64x192;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x256x64_F16E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_256,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<256, 64>;
+  using CLayout = GMMA::CLayout_64x256;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x256x64_F16E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_256,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<256, 64>;
+  using CLayout = GMMA::CLayout_64x256;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x256x64_F32E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_256,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<256, 64>;
+  using CLayout = GMMA::CLayout_64x256;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x256x64_F32E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_256,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<256, 64>;
+  using CLayout = GMMA::CLayout_64x256;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x8x64_F16E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_8,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<  8, 64>;
+  using CLayout = GMMA::CLayout_64x8;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x8x64_F16E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_8,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<  8, 64>;
+  using CLayout = GMMA::CLayout_64x8;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x8x64_F32E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_8,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<  8, 64>;
+  using CLayout = GMMA::CLayout_64x8;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x8x64_F32E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_8,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<  8, 64>;
+  using CLayout = GMMA::CLayout_64x8;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x16x64_F16E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_16,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 16, 64>;
+  using CLayout = GMMA::CLayout_64x16;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x16x64_F16E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_16,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 16, 64>;
+  using CLayout = GMMA::CLayout_64x16;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x16x64_F32E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_16,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 16, 64>;
+  using CLayout = GMMA::CLayout_64x16;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x16x64_F32E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_16,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 16, 64>;
+  using CLayout = GMMA::CLayout_64x16;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x32x64_F16E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_32,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 32, 64>;
+  using CLayout = GMMA::CLayout_64x32;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x32x64_F16E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_32,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 32, 64>;
+  using CLayout = GMMA::CLayout_64x32;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x32x64_F32E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_32,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 32, 64>;
+  using CLayout = GMMA::CLayout_64x32;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x32x64_F32E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_32,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 32, 64>;
+  using CLayout = GMMA::CLayout_64x32;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x64x64_F16E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_64,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 64, 64>;
+  using CLayout = GMMA::CLayout_64x64;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x64x64_F16E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_64,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 64, 64>;
+  using CLayout = GMMA::CLayout_64x64;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x64x64_F32E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_64,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 64, 64>;
+  using CLayout = GMMA::CLayout_64x64;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x64x64_F32E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_64,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 64, 64>;
+  using CLayout = GMMA::CLayout_64x64;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x96x64_F16E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_96,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 96, 64>;
+  using CLayout = GMMA::CLayout_64x96;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x96x64_F16E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_96,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 96, 64>;
+  using CLayout = GMMA::CLayout_64x96;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x96x64_F32E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_96,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 96, 64>;
+  using CLayout = GMMA::CLayout_64x96;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x96x64_F32E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_96,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 96, 64>;
+  using CLayout = GMMA::CLayout_64x96;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x128x64_F16E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_128,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<128, 64>;
+  using CLayout = GMMA::CLayout_64x128;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x128x64_F16E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_128,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<128, 64>;
+  using CLayout = GMMA::CLayout_64x128;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x128x64_F32E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_128,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<128, 64>;
+  using CLayout = GMMA::CLayout_64x128;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x128x64_F32E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_128,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<128, 64>;
+  using CLayout = GMMA::CLayout_64x128;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x192x64_F16E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_192,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<192, 64>;
+  using CLayout = GMMA::CLayout_64x192;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x192x64_F16E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_192,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<192, 64>;
+  using CLayout = GMMA::CLayout_64x192;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x192x64_F32E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_192,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<192, 64>;
+  using CLayout = GMMA::CLayout_64x192;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x192x64_F32E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_192,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<192, 64>;
+  using CLayout = GMMA::CLayout_64x192;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x256x64_F16E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_256,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<256, 64>;
+  using CLayout = GMMA::CLayout_64x256;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x256x64_F16E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_256,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<256, 64>;
+  using CLayout = GMMA::CLayout_64x256;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x256x64_F32E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_256,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<256, 64>;
+  using CLayout = GMMA::CLayout_64x256;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x256x64_F32E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_256,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<256, 64>;
+  using CLayout = GMMA::CLayout_64x256;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x8x64_F16E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_8,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<  8, 64>;
+  using CLayout = GMMA::CLayout_64x8;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x8x64_F16E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_8,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<  8, 64>;
+  using CLayout = GMMA::CLayout_64x8;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x8x64_F32E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_8,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<  8, 64>;
+  using CLayout = GMMA::CLayout_64x8;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x8x64_F32E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_8,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<  8, 64>;
+  using CLayout = GMMA::CLayout_64x8;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x16x64_F16E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_16,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 16, 64>;
+  using CLayout = GMMA::CLayout_64x16;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x16x64_F16E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_16,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 16, 64>;
+  using CLayout = GMMA::CLayout_64x16;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x16x64_F32E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_16,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 16, 64>;
+  using CLayout = GMMA::CLayout_64x16;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x16x64_F32E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_16,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 16, 64>;
+  using CLayout = GMMA::CLayout_64x16;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x32x64_F16E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_32,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 32, 64>;
+  using CLayout = GMMA::CLayout_64x32;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x32x64_F16E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_32,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 32, 64>;
+  using CLayout = GMMA::CLayout_64x32;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x32x64_F32E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_32,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 32, 64>;
+  using CLayout = GMMA::CLayout_64x32;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x32x64_F32E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_32,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 32, 64>;
+  using CLayout = GMMA::CLayout_64x32;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x64x64_F16E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_64,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 64, 64>;
+  using CLayout = GMMA::CLayout_64x64;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x64x64_F16E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_64,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 64, 64>;
+  using CLayout = GMMA::CLayout_64x64;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x64x64_F32E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_64,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 64, 64>;
+  using CLayout = GMMA::CLayout_64x64;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x64x64_F32E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_64,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 64, 64>;
+  using CLayout = GMMA::CLayout_64x64;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x96x64_F16E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_96,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 96, 64>;
+  using CLayout = GMMA::CLayout_64x96;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x96x64_F16E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_96,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 96, 64>;
+  using CLayout = GMMA::CLayout_64x96;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x96x64_F32E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_96,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 96, 64>;
+  using CLayout = GMMA::CLayout_64x96;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x96x64_F32E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_96,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 96, 64>;
+  using CLayout = GMMA::CLayout_64x96;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x128x64_F16E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_128,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<128, 64>;
+  using CLayout = GMMA::CLayout_64x128;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x128x64_F16E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_128,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<128, 64>;
+  using CLayout = GMMA::CLayout_64x128;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x128x64_F32E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_128,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<128, 64>;
+  using CLayout = GMMA::CLayout_64x128;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x128x64_F32E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_128,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<128, 64>;
+  using CLayout = GMMA::CLayout_64x128;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x192x64_F16E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_192,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<192, 64>;
+  using CLayout = GMMA::CLayout_64x192;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x192x64_F16E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_192,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<192, 64>;
+  using CLayout = GMMA::CLayout_64x192;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x192x64_F32E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_192,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<192, 64>;
+  using CLayout = GMMA::CLayout_64x192;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x192x64_F32E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_192,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<192, 64>;
+  using CLayout = GMMA::CLayout_64x192;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x256x64_F16E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_256,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<256, 64>;
+  using CLayout = GMMA::CLayout_64x256;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x256x64_F16E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_256,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<256, 64>;
+  using CLayout = GMMA::CLayout_64x256;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x256x64_F32E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_256,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<256, 64>;
+  using CLayout = GMMA::CLayout_64x256;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x256x64_F32E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_256,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<256, 64>;
+  using CLayout = GMMA::CLayout_64x256;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+} // end namespace cute
+
+#if defined(CUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
+#include "mma_traits_sm90_gmma_sparse_ext.hpp"
+#endif
\ No newline at end of file
diff --git a/include/cute/atom/mma_traits_sm90_gmma_sparse_ext.hpp b/include/cute/atom/mma_traits_sm90_gmma_sparse_ext.hpp
new file mode 100644
index 0000000000..3680b7e13f
--- /dev/null
+++ b/include/cute/atom/mma_traits_sm90_gmma_sparse_ext.hpp
@@ -0,0 +1,17335 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+ 
+#pragma once
+  
+#include <cute/arch/mma_sm90.hpp>
+#include <cute/atom/mma_traits.hpp>
+
+namespace cute {
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x24x32_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_24,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 24, 32>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x24x32_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_24,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 24, 32>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x40x32_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_40,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 40, 32>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x40x32_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_40,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 40, 32>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x48x32_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_48,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 48, 32>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x48x32_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_48,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 48, 32>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x56x32_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_56,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 56, 32>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x56x32_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_56,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 56, 32>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x72x32_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_72,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 72, 32>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x72x32_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_72,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 72, 32>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x80x32_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_80,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 80, 32>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x80x32_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_80,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 80, 32>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x88x32_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_88,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 88, 32>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x88x32_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_88,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 88, 32>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x104x32_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_104,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<104, 32>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x104x32_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_104,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<104, 32>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x112x32_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_112,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<112, 32>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x112x32_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_112,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<112, 32>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x120x32_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_120,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<120, 32>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x120x32_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_120,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<120, 32>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x136x32_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_136,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<136, 32>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x136x32_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_136,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<136, 32>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x144x32_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_144,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<144, 32>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x144x32_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_144,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<144, 32>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x152x32_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_152,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<152, 32>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x152x32_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_152,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<152, 32>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x160x32_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_160,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<160, 32>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x160x32_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_160,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<160, 32>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x168x32_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_168,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<168, 32>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x168x32_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_168,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<168, 32>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x176x32_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_176,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<176, 32>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x176x32_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_176,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<176, 32>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x184x32_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_184,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<184, 32>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x184x32_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_184,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<184, 32>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x200x32_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_200,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<200, 32>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x200x32_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_200,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<200, 32>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x208x32_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_208,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<208, 32>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x208x32_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_208,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<208, 32>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x216x32_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_216,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<216, 32>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x216x32_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_216,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<216, 32>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x224x32_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_224,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<224, 32>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x224x32_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_224,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<224, 32>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x232x32_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_232,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<232, 32>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x232x32_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_232,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<232, 32>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x240x32_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_240,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<240, 32>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x240x32_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_240,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<240, 32>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x248x32_F16F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_248,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<248, 32>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x248x32_F16F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_248,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<248, 32>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x24x32_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_24,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 24, 32>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x24x32_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_24,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 24, 32>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x40x32_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_40,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 40, 32>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x40x32_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_40,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 40, 32>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x48x32_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_48,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 48, 32>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x48x32_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_48,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 48, 32>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x56x32_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_56,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 56, 32>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x56x32_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_56,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 56, 32>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x72x32_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_72,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 72, 32>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x72x32_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_72,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 72, 32>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x80x32_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_80,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 80, 32>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x80x32_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_80,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 80, 32>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x88x32_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_88,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 88, 32>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x88x32_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_88,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 88, 32>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x104x32_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_104,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<104, 32>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x104x32_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_104,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<104, 32>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x112x32_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_112,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<112, 32>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x112x32_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_112,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<112, 32>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x120x32_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_120,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<120, 32>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x120x32_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_120,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<120, 32>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x136x32_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_136,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<136, 32>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x136x32_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_136,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<136, 32>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x144x32_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_144,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<144, 32>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x144x32_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_144,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<144, 32>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x152x32_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_152,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<152, 32>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x152x32_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_152,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<152, 32>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x160x32_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_160,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<160, 32>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x160x32_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_160,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<160, 32>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x168x32_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_168,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<168, 32>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x168x32_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_168,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<168, 32>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x176x32_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_176,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<176, 32>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x176x32_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_176,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<176, 32>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x184x32_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_184,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<184, 32>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x184x32_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_184,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<184, 32>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x200x32_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_200,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<200, 32>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x200x32_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_200,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<200, 32>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x208x32_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_208,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<208, 32>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x208x32_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_208,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<208, 32>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x216x32_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_216,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<216, 32>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x216x32_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_216,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<216, 32>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x224x32_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_224,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<224, 32>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x224x32_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_224,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<224, 32>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x232x32_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_232,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<232, 32>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x232x32_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_232,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<232, 32>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x240x32_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_240,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<240, 32>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x240x32_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_240,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<240, 32>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x248x32_F32F16F16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_248,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<248, 32>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x248x32_F32F16F16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, half_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = half_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_248,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<248, 32>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x24x32_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_24,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 24, 32>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x24x32_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_24,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 24, 32>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x40x32_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_40,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 40, 32>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x40x32_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_40,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 40, 32>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x48x32_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_48,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 48, 32>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x48x32_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_48,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 48, 32>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x56x32_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_56,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 56, 32>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x56x32_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_56,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 56, 32>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x72x32_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_72,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 72, 32>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x72x32_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_72,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 72, 32>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x80x32_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_80,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 80, 32>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x80x32_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_80,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 80, 32>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x88x32_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_88,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 88, 32>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x88x32_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_88,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout< 88, 32>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x104x32_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_104,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<104, 32>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x104x32_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_104,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<104, 32>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x112x32_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_112,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<112, 32>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x112x32_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_112,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<112, 32>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x120x32_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_120,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<120, 32>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x120x32_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_120,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<120, 32>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x136x32_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_136,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<136, 32>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x136x32_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_136,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<136, 32>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x144x32_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_144,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<144, 32>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x144x32_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_144,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<144, 32>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x152x32_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_152,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<152, 32>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x152x32_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_152,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<152, 32>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x160x32_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_160,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<160, 32>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x160x32_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_160,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<160, 32>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x168x32_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_168,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<168, 32>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x168x32_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_168,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<168, 32>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x176x32_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_176,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<176, 32>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x176x32_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_176,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<176, 32>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x184x32_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_184,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<184, 32>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x184x32_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_184,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<184, 32>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x200x32_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_200,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<200, 32>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x200x32_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_200,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<200, 32>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x208x32_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_208,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<208, 32>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x208x32_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_208,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<208, 32>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x216x32_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_216,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<216, 32>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x216x32_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_216,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<216, 32>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x224x32_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_224,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<224, 32>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x224x32_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_224,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<224, 32>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x232x32_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_232,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<232, 32>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x232x32_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_232,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<232, 32>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x240x32_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_240,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<240, 32>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x240x32_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_240,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<240, 32>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x248x32_F32BF16BF16_SS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<tnspA>;
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_248,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 32>;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<248, 32>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::Major tnspA, GMMA::Major tnspB, GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x248x32_F32BF16BF16_RS<tnspA, tnspB, scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, bfloat16_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = bfloat16_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<tnspB>;
+
+  using Shape_MNK = Shape<_64,_248,_32>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x32;
+  using ELayout = GMMA::ELayout_64x32;
+  using BLayout = GMMA::ABLayout<248, 32>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x24x16_F32TF32TF32_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout< 24, 16>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x24x16_F32TF32TF32_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout< 24, 16>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x40x16_F32TF32TF32_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_40,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout< 40, 16>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x40x16_F32TF32TF32_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_40,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout< 40, 16>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x48x16_F32TF32TF32_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout< 48, 16>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x48x16_F32TF32TF32_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout< 48, 16>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x56x16_F32TF32TF32_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_56,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout< 56, 16>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x56x16_F32TF32TF32_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_56,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout< 56, 16>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x72x16_F32TF32TF32_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_72,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout< 72, 16>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x72x16_F32TF32TF32_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_72,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout< 72, 16>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x80x16_F32TF32TF32_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout< 80, 16>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x80x16_F32TF32TF32_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout< 80, 16>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x88x16_F32TF32TF32_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_88,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout< 88, 16>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x88x16_F32TF32TF32_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_88,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout< 88, 16>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x104x16_F32TF32TF32_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_104,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout<104, 16>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x104x16_F32TF32TF32_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_104,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout<104, 16>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x112x16_F32TF32TF32_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout<112, 16>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x112x16_F32TF32TF32_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout<112, 16>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x120x16_F32TF32TF32_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_120,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout<120, 16>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x120x16_F32TF32TF32_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_120,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout<120, 16>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x136x16_F32TF32TF32_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_136,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout<136, 16>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x136x16_F32TF32TF32_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_136,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout<136, 16>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x144x16_F32TF32TF32_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout<144, 16>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x144x16_F32TF32TF32_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout<144, 16>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x152x16_F32TF32TF32_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_152,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout<152, 16>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x152x16_F32TF32TF32_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_152,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout<152, 16>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x160x16_F32TF32TF32_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout<160, 16>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x160x16_F32TF32TF32_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout<160, 16>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x168x16_F32TF32TF32_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_168,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout<168, 16>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x168x16_F32TF32TF32_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_168,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout<168, 16>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x176x16_F32TF32TF32_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout<176, 16>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x176x16_F32TF32TF32_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout<176, 16>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x184x16_F32TF32TF32_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_184,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout<184, 16>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x184x16_F32TF32TF32_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_184,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout<184, 16>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x200x16_F32TF32TF32_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_200,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout<200, 16>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x200x16_F32TF32TF32_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_200,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout<200, 16>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x208x16_F32TF32TF32_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout<208, 16>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x208x16_F32TF32TF32_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout<208, 16>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x216x16_F32TF32TF32_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_216,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout<216, 16>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x216x16_F32TF32TF32_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_216,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout<216, 16>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x224x16_F32TF32TF32_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout<224, 16>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x224x16_F32TF32TF32_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout<224, 16>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x232x16_F32TF32TF32_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_232,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout<232, 16>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x232x16_F32TF32TF32_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_232,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout<232, 16>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x240x16_F32TF32TF32_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout<240, 16>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x240x16_F32TF32TF32_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout<240, 16>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x248x16_F32TF32TF32_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_248,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 16>;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout<248, 16>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x248x16_F32TF32TF32_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, tfloat32_t>;
+  using ValTypeE = sparse_elem<4, uint8_t>;
+  using ValTypeB = tfloat32_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_248,_16>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x16;
+  using ELayout = GMMA::ELayout_64x16;
+  using BLayout = GMMA::ABLayout<248, 16>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x24x64_S32S8S8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 24, 64>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x24x64_S32S8S8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 24, 64>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x48x64_S32S8S8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 48, 64>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x48x64_S32S8S8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 48, 64>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x80x64_S32S8S8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 80, 64>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x80x64_S32S8S8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 80, 64>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x112x64_S32S8S8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<112, 64>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x112x64_S32S8S8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<112, 64>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x144x64_S32S8S8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<144, 64>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x144x64_S32S8S8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<144, 64>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x160x64_S32S8S8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<160, 64>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x160x64_S32S8S8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<160, 64>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x176x64_S32S8S8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<176, 64>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x176x64_S32S8S8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<176, 64>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x208x64_S32S8S8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<208, 64>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x208x64_S32S8S8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<208, 64>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x224x64_S32S8S8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<224, 64>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x224x64_S32S8S8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<224, 64>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x240x64_S32S8S8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<240, 64>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x240x64_S32S8S8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<240, 64>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x24x64_S32S8S8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 24, 64>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x24x64_S32S8S8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 24, 64>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x48x64_S32S8S8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 48, 64>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x48x64_S32S8S8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 48, 64>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x80x64_S32S8S8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 80, 64>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x80x64_S32S8S8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 80, 64>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x112x64_S32S8S8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<112, 64>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x112x64_S32S8S8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<112, 64>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x144x64_S32S8S8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<144, 64>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x144x64_S32S8S8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<144, 64>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x160x64_S32S8S8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<160, 64>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x160x64_S32S8S8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<160, 64>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x176x64_S32S8S8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<176, 64>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x176x64_S32S8S8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<176, 64>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x208x64_S32S8S8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<208, 64>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x208x64_S32S8S8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<208, 64>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x224x64_S32S8S8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<224, 64>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x224x64_S32S8S8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<224, 64>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x240x64_S32S8S8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<240, 64>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x240x64_S32S8S8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<240, 64>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x24x64_S32S8U8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 24, 64>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x24x64_S32S8U8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 24, 64>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x48x64_S32S8U8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 48, 64>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x48x64_S32S8U8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 48, 64>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x80x64_S32S8U8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 80, 64>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x80x64_S32S8U8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 80, 64>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x112x64_S32S8U8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<112, 64>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x112x64_S32S8U8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<112, 64>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x144x64_S32S8U8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<144, 64>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x144x64_S32S8U8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<144, 64>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x160x64_S32S8U8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<160, 64>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x160x64_S32S8U8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<160, 64>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x176x64_S32S8U8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<176, 64>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x176x64_S32S8U8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<176, 64>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x208x64_S32S8U8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<208, 64>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x208x64_S32S8U8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<208, 64>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x224x64_S32S8U8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<224, 64>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x224x64_S32S8U8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<224, 64>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x240x64_S32S8U8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<240, 64>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x240x64_S32S8U8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<240, 64>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x24x64_S32S8U8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 24, 64>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x24x64_S32S8U8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 24, 64>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x48x64_S32S8U8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 48, 64>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x48x64_S32S8U8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 48, 64>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x80x64_S32S8U8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 80, 64>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x80x64_S32S8U8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 80, 64>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x112x64_S32S8U8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<112, 64>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x112x64_S32S8U8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<112, 64>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x144x64_S32S8U8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<144, 64>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x144x64_S32S8U8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<144, 64>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x160x64_S32S8U8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<160, 64>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x160x64_S32S8U8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<160, 64>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x176x64_S32S8U8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<176, 64>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x176x64_S32S8U8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<176, 64>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x208x64_S32S8U8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<208, 64>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x208x64_S32S8U8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<208, 64>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x224x64_S32S8U8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<224, 64>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x224x64_S32S8U8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<224, 64>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x240x64_S32S8U8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<240, 64>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x240x64_S32S8U8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, int8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<240, 64>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x24x64_S32U8S8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 24, 64>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x24x64_S32U8S8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 24, 64>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x48x64_S32U8S8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 48, 64>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x48x64_S32U8S8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 48, 64>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x80x64_S32U8S8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 80, 64>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x80x64_S32U8S8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 80, 64>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x112x64_S32U8S8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<112, 64>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x112x64_S32U8S8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<112, 64>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x144x64_S32U8S8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<144, 64>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x144x64_S32U8S8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<144, 64>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x160x64_S32U8S8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<160, 64>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x160x64_S32U8S8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<160, 64>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x176x64_S32U8S8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<176, 64>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x176x64_S32U8S8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<176, 64>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x208x64_S32U8S8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<208, 64>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x208x64_S32U8S8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<208, 64>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x224x64_S32U8S8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<224, 64>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x224x64_S32U8S8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<224, 64>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x240x64_S32U8S8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<240, 64>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x240x64_S32U8S8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<240, 64>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x24x64_S32U8S8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 24, 64>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x24x64_S32U8S8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 24, 64>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x48x64_S32U8S8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 48, 64>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x48x64_S32U8S8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 48, 64>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x80x64_S32U8S8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 80, 64>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x80x64_S32U8S8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 80, 64>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x112x64_S32U8S8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<112, 64>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x112x64_S32U8S8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<112, 64>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x144x64_S32U8S8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<144, 64>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x144x64_S32U8S8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<144, 64>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x160x64_S32U8S8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<160, 64>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x160x64_S32U8S8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<160, 64>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x176x64_S32U8S8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<176, 64>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x176x64_S32U8S8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<176, 64>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x208x64_S32U8S8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<208, 64>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x208x64_S32U8S8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<208, 64>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x224x64_S32U8S8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<224, 64>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x224x64_S32U8S8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<224, 64>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x240x64_S32U8S8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<240, 64>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x240x64_S32U8S8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = int8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<240, 64>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x24x64_S32U8U8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 24, 64>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x24x64_S32U8U8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 24, 64>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x48x64_S32U8U8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 48, 64>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x48x64_S32U8U8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 48, 64>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x80x64_S32U8U8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 80, 64>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x80x64_S32U8U8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 80, 64>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x112x64_S32U8U8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<112, 64>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x112x64_S32U8U8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<112, 64>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x144x64_S32U8U8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<144, 64>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x144x64_S32U8U8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<144, 64>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x160x64_S32U8U8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<160, 64>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x160x64_S32U8U8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<160, 64>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x176x64_S32U8U8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<176, 64>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x176x64_S32U8U8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<176, 64>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x208x64_S32U8U8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<208, 64>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x208x64_S32U8U8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<208, 64>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x224x64_S32U8U8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<224, 64>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x224x64_S32U8U8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<224, 64>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x240x64_S32U8U8_SS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<240, 64>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x240x64_S32U8U8_SS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<240, 64>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x24x64_S32U8U8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 24, 64>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x24x64_S32U8U8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 24, 64>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x48x64_S32U8U8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 48, 64>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x48x64_S32U8U8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 48, 64>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x80x64_S32U8U8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 80, 64>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x80x64_S32U8U8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 80, 64>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x112x64_S32U8U8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<112, 64>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x112x64_S32U8U8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<112, 64>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x144x64_S32U8U8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<144, 64>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x144x64_S32U8U8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<144, 64>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x160x64_S32U8U8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<160, 64>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x160x64_S32U8U8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<160, 64>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x176x64_S32U8U8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<176, 64>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x176x64_S32U8U8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<176, 64>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x208x64_S32U8U8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<208, 64>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x208x64_S32U8U8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<208, 64>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x224x64_S32U8U8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<224, 64>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x224x64_S32U8U8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<224, 64>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x240x64_S32U8U8_RS_TN<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<240, 64>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x240x64_S32U8U8_RS_TN_SATURATE<spsel>>
+{
+  using ValTypeD = int32_t;
+  using ValTypeA = sparse_elem<2, uint8_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = uint8_t;
+  using ValTypeC = int32_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<240, 64>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x24x64_F16E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 24, 64>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x24x64_F16E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 24, 64>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x24x64_F32E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 24, 64>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x24x64_F32E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 24, 64>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x40x64_F16E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_40,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 40, 64>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x40x64_F16E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_40,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 40, 64>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x40x64_F32E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_40,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 40, 64>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x40x64_F32E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_40,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 40, 64>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x48x64_F16E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 48, 64>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x48x64_F16E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 48, 64>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x48x64_F32E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 48, 64>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x48x64_F32E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 48, 64>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x56x64_F16E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_56,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 56, 64>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x56x64_F16E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_56,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 56, 64>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x56x64_F32E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_56,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 56, 64>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x56x64_F32E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_56,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 56, 64>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x72x64_F16E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_72,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 72, 64>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x72x64_F16E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_72,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 72, 64>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x72x64_F32E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_72,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 72, 64>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x72x64_F32E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_72,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 72, 64>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x80x64_F16E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 80, 64>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x80x64_F16E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 80, 64>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x80x64_F32E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 80, 64>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x80x64_F32E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 80, 64>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x88x64_F16E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_88,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 88, 64>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x88x64_F16E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_88,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 88, 64>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x88x64_F32E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_88,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 88, 64>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x88x64_F32E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_88,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 88, 64>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x104x64_F16E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_104,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<104, 64>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x104x64_F16E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_104,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<104, 64>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x104x64_F32E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_104,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<104, 64>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x104x64_F32E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_104,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<104, 64>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x112x64_F16E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<112, 64>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x112x64_F16E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<112, 64>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x112x64_F32E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<112, 64>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x112x64_F32E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<112, 64>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x120x64_F16E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_120,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<120, 64>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x120x64_F16E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_120,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<120, 64>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x120x64_F32E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_120,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<120, 64>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x120x64_F32E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_120,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<120, 64>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x136x64_F16E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_136,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<136, 64>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x136x64_F16E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_136,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<136, 64>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x136x64_F32E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_136,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<136, 64>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x136x64_F32E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_136,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<136, 64>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x144x64_F16E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<144, 64>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x144x64_F16E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<144, 64>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x144x64_F32E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<144, 64>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x144x64_F32E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<144, 64>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x152x64_F16E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_152,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<152, 64>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x152x64_F16E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_152,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<152, 64>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x152x64_F32E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_152,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<152, 64>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x152x64_F32E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_152,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<152, 64>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x160x64_F16E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<160, 64>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x160x64_F16E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<160, 64>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x160x64_F32E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<160, 64>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x160x64_F32E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<160, 64>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x168x64_F16E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_168,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<168, 64>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x168x64_F16E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_168,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<168, 64>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x168x64_F32E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_168,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<168, 64>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x168x64_F32E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_168,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<168, 64>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x176x64_F16E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<176, 64>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x176x64_F16E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<176, 64>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x176x64_F32E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<176, 64>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x176x64_F32E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<176, 64>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x184x64_F16E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_184,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<184, 64>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x184x64_F16E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_184,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<184, 64>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x184x64_F32E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_184,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<184, 64>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x184x64_F32E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_184,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<184, 64>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x200x64_F16E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_200,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<200, 64>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x200x64_F16E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_200,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<200, 64>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x200x64_F32E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_200,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<200, 64>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x200x64_F32E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_200,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<200, 64>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x208x64_F16E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<208, 64>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x208x64_F16E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<208, 64>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x208x64_F32E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<208, 64>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x208x64_F32E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<208, 64>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x216x64_F16E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_216,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<216, 64>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x216x64_F16E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_216,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<216, 64>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x216x64_F32E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_216,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<216, 64>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x216x64_F32E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_216,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<216, 64>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x224x64_F16E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<224, 64>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x224x64_F16E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<224, 64>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x224x64_F32E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<224, 64>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x224x64_F32E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<224, 64>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x232x64_F16E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_232,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<232, 64>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x232x64_F16E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_232,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<232, 64>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x232x64_F32E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_232,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<232, 64>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x232x64_F32E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_232,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<232, 64>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x240x64_F16E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<240, 64>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x240x64_F16E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<240, 64>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x240x64_F32E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<240, 64>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x240x64_F32E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<240, 64>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x248x64_F16E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_248,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<248, 64>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x248x64_F16E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_248,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<248, 64>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x248x64_F32E4M3E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_248,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<248, 64>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x248x64_F32E4M3E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_248,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<248, 64>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x24x64_F16E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 24, 64>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x24x64_F16E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 24, 64>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x24x64_F32E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 24, 64>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x24x64_F32E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 24, 64>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x40x64_F16E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_40,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 40, 64>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x40x64_F16E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_40,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 40, 64>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x40x64_F32E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_40,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 40, 64>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x40x64_F32E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_40,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 40, 64>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x48x64_F16E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 48, 64>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x48x64_F16E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 48, 64>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x48x64_F32E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 48, 64>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x48x64_F32E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 48, 64>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x56x64_F16E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_56,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 56, 64>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x56x64_F16E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_56,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 56, 64>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x56x64_F32E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_56,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 56, 64>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x56x64_F32E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_56,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 56, 64>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x72x64_F16E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_72,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 72, 64>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x72x64_F16E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_72,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 72, 64>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x72x64_F32E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_72,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 72, 64>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x72x64_F32E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_72,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 72, 64>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x80x64_F16E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 80, 64>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x80x64_F16E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 80, 64>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x80x64_F32E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 80, 64>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x80x64_F32E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 80, 64>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x88x64_F16E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_88,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 88, 64>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x88x64_F16E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_88,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 88, 64>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x88x64_F32E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_88,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 88, 64>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x88x64_F32E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_88,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 88, 64>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x104x64_F16E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_104,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<104, 64>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x104x64_F16E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_104,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<104, 64>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x104x64_F32E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_104,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<104, 64>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x104x64_F32E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_104,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<104, 64>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x112x64_F16E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<112, 64>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x112x64_F16E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<112, 64>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x112x64_F32E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<112, 64>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x112x64_F32E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<112, 64>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x120x64_F16E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_120,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<120, 64>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x120x64_F16E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_120,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<120, 64>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x120x64_F32E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_120,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<120, 64>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x120x64_F32E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_120,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<120, 64>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x136x64_F16E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_136,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<136, 64>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x136x64_F16E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_136,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<136, 64>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x136x64_F32E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_136,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<136, 64>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x136x64_F32E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_136,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<136, 64>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x144x64_F16E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<144, 64>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x144x64_F16E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<144, 64>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x144x64_F32E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<144, 64>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x144x64_F32E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<144, 64>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x152x64_F16E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_152,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<152, 64>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x152x64_F16E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_152,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<152, 64>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x152x64_F32E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_152,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<152, 64>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x152x64_F32E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_152,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<152, 64>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x160x64_F16E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<160, 64>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x160x64_F16E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<160, 64>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x160x64_F32E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<160, 64>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x160x64_F32E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<160, 64>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x168x64_F16E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_168,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<168, 64>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x168x64_F16E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_168,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<168, 64>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x168x64_F32E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_168,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<168, 64>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x168x64_F32E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_168,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<168, 64>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x176x64_F16E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<176, 64>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x176x64_F16E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<176, 64>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x176x64_F32E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<176, 64>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x176x64_F32E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<176, 64>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x184x64_F16E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_184,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<184, 64>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x184x64_F16E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_184,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<184, 64>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x184x64_F32E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_184,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<184, 64>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x184x64_F32E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_184,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<184, 64>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x200x64_F16E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_200,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<200, 64>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x200x64_F16E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_200,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<200, 64>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x200x64_F32E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_200,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<200, 64>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x200x64_F32E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_200,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<200, 64>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x208x64_F16E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<208, 64>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x208x64_F16E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<208, 64>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x208x64_F32E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<208, 64>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x208x64_F32E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<208, 64>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x216x64_F16E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_216,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<216, 64>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x216x64_F16E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_216,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<216, 64>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x216x64_F32E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_216,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<216, 64>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x216x64_F32E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_216,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<216, 64>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x224x64_F16E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<224, 64>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x224x64_F16E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<224, 64>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x224x64_F32E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<224, 64>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x224x64_F32E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<224, 64>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x232x64_F16E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_232,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<232, 64>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x232x64_F16E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_232,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<232, 64>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x232x64_F32E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_232,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<232, 64>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x232x64_F32E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_232,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<232, 64>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x240x64_F16E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<240, 64>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x240x64_F16E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<240, 64>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x240x64_F32E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<240, 64>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x240x64_F32E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<240, 64>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x248x64_F16E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_248,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<248, 64>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x248x64_F16E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_248,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<248, 64>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x248x64_F32E4M3E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_248,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<248, 64>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x248x64_F32E4M3E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e4m3_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_248,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<248, 64>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x24x64_F16E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 24, 64>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x24x64_F16E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 24, 64>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x24x64_F32E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 24, 64>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x24x64_F32E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 24, 64>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x40x64_F16E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_40,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 40, 64>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x40x64_F16E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_40,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 40, 64>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x40x64_F32E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_40,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 40, 64>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x40x64_F32E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_40,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 40, 64>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x48x64_F16E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 48, 64>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x48x64_F16E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 48, 64>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x48x64_F32E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 48, 64>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x48x64_F32E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 48, 64>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x56x64_F16E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_56,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 56, 64>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x56x64_F16E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_56,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 56, 64>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x56x64_F32E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_56,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 56, 64>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x56x64_F32E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_56,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 56, 64>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x72x64_F16E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_72,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 72, 64>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x72x64_F16E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_72,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 72, 64>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x72x64_F32E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_72,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 72, 64>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x72x64_F32E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_72,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 72, 64>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x80x64_F16E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 80, 64>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x80x64_F16E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 80, 64>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x80x64_F32E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 80, 64>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x80x64_F32E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 80, 64>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x88x64_F16E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_88,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 88, 64>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x88x64_F16E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_88,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 88, 64>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x88x64_F32E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_88,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 88, 64>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x88x64_F32E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_88,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 88, 64>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x104x64_F16E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_104,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<104, 64>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x104x64_F16E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_104,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<104, 64>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x104x64_F32E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_104,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<104, 64>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x104x64_F32E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_104,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<104, 64>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x112x64_F16E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<112, 64>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x112x64_F16E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<112, 64>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x112x64_F32E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<112, 64>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x112x64_F32E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<112, 64>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x120x64_F16E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_120,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<120, 64>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x120x64_F16E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_120,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<120, 64>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x120x64_F32E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_120,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<120, 64>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x120x64_F32E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_120,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<120, 64>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x136x64_F16E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_136,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<136, 64>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x136x64_F16E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_136,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<136, 64>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x136x64_F32E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_136,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<136, 64>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x136x64_F32E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_136,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<136, 64>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x144x64_F16E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<144, 64>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x144x64_F16E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<144, 64>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x144x64_F32E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<144, 64>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x144x64_F32E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<144, 64>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x152x64_F16E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_152,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<152, 64>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x152x64_F16E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_152,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<152, 64>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x152x64_F32E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_152,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<152, 64>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x152x64_F32E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_152,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<152, 64>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x160x64_F16E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<160, 64>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x160x64_F16E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<160, 64>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x160x64_F32E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<160, 64>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x160x64_F32E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<160, 64>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x168x64_F16E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_168,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<168, 64>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x168x64_F16E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_168,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<168, 64>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x168x64_F32E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_168,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<168, 64>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x168x64_F32E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_168,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<168, 64>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x176x64_F16E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<176, 64>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x176x64_F16E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<176, 64>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x176x64_F32E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<176, 64>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x176x64_F32E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<176, 64>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x184x64_F16E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_184,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<184, 64>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x184x64_F16E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_184,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<184, 64>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x184x64_F32E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_184,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<184, 64>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x184x64_F32E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_184,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<184, 64>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x200x64_F16E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_200,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<200, 64>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x200x64_F16E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_200,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<200, 64>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x200x64_F32E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_200,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<200, 64>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x200x64_F32E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_200,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<200, 64>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x208x64_F16E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<208, 64>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x208x64_F16E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<208, 64>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x208x64_F32E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<208, 64>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x208x64_F32E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<208, 64>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x216x64_F16E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_216,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<216, 64>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x216x64_F16E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_216,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<216, 64>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x216x64_F32E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_216,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<216, 64>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x216x64_F32E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_216,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<216, 64>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x224x64_F16E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<224, 64>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x224x64_F16E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<224, 64>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x224x64_F32E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<224, 64>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x224x64_F32E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<224, 64>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x232x64_F16E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_232,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<232, 64>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x232x64_F16E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_232,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<232, 64>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x232x64_F32E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_232,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<232, 64>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x232x64_F32E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_232,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<232, 64>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x240x64_F16E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<240, 64>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x240x64_F16E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<240, 64>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x240x64_F32E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<240, 64>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x240x64_F32E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<240, 64>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x248x64_F16E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_248,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<248, 64>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x248x64_F16E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_248,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<248, 64>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x248x64_F32E5M2E4M3_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_248,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<248, 64>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x248x64_F32E5M2E4M3_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e4m3_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_248,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<248, 64>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x24x64_F16E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 24, 64>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x24x64_F16E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 24, 64>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x24x64_F32E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 24, 64>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x24x64_F32E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_24,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 24, 64>;
+  using CLayout = GMMA::CLayout_64x24;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x40x64_F16E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_40,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 40, 64>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x40x64_F16E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_40,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 40, 64>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x40x64_F32E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_40,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 40, 64>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x40x64_F32E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_40,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 40, 64>;
+  using CLayout = GMMA::CLayout_64x40;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x48x64_F16E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 48, 64>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x48x64_F16E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 48, 64>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x48x64_F32E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 48, 64>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x48x64_F32E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_48,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 48, 64>;
+  using CLayout = GMMA::CLayout_64x48;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x56x64_F16E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_56,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 56, 64>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x56x64_F16E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_56,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 56, 64>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x56x64_F32E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_56,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 56, 64>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x56x64_F32E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_56,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 56, 64>;
+  using CLayout = GMMA::CLayout_64x56;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x72x64_F16E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_72,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 72, 64>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x72x64_F16E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_72,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 72, 64>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x72x64_F32E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_72,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 72, 64>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x72x64_F32E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_72,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 72, 64>;
+  using CLayout = GMMA::CLayout_64x72;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x80x64_F16E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 80, 64>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x80x64_F16E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 80, 64>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x80x64_F32E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 80, 64>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x80x64_F32E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_80,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 80, 64>;
+  using CLayout = GMMA::CLayout_64x80;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x88x64_F16E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_88,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 88, 64>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x88x64_F16E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_88,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 88, 64>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x88x64_F32E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_88,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 88, 64>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x88x64_F32E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_88,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout< 88, 64>;
+  using CLayout = GMMA::CLayout_64x88;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x104x64_F16E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_104,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<104, 64>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x104x64_F16E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_104,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<104, 64>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x104x64_F32E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_104,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<104, 64>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x104x64_F32E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_104,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<104, 64>;
+  using CLayout = GMMA::CLayout_64x104;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x112x64_F16E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<112, 64>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x112x64_F16E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<112, 64>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x112x64_F32E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<112, 64>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x112x64_F32E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_112,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<112, 64>;
+  using CLayout = GMMA::CLayout_64x112;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x120x64_F16E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_120,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<120, 64>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x120x64_F16E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_120,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<120, 64>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x120x64_F32E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_120,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<120, 64>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x120x64_F32E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_120,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<120, 64>;
+  using CLayout = GMMA::CLayout_64x120;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x136x64_F16E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_136,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<136, 64>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x136x64_F16E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_136,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<136, 64>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x136x64_F32E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_136,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<136, 64>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x136x64_F32E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_136,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<136, 64>;
+  using CLayout = GMMA::CLayout_64x136;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x144x64_F16E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<144, 64>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x144x64_F16E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<144, 64>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x144x64_F32E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<144, 64>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x144x64_F32E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_144,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<144, 64>;
+  using CLayout = GMMA::CLayout_64x144;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x152x64_F16E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_152,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<152, 64>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x152x64_F16E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_152,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<152, 64>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x152x64_F32E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_152,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<152, 64>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x152x64_F32E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_152,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<152, 64>;
+  using CLayout = GMMA::CLayout_64x152;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x160x64_F16E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<160, 64>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x160x64_F16E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<160, 64>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x160x64_F32E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<160, 64>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x160x64_F32E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_160,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<160, 64>;
+  using CLayout = GMMA::CLayout_64x160;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x168x64_F16E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_168,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<168, 64>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x168x64_F16E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_168,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<168, 64>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x168x64_F32E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_168,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<168, 64>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x168x64_F32E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_168,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<168, 64>;
+  using CLayout = GMMA::CLayout_64x168;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x176x64_F16E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<176, 64>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x176x64_F16E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<176, 64>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x176x64_F32E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<176, 64>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x176x64_F32E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_176,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<176, 64>;
+  using CLayout = GMMA::CLayout_64x176;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x184x64_F16E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_184,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<184, 64>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x184x64_F16E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_184,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<184, 64>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x184x64_F32E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_184,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<184, 64>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x184x64_F32E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_184,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<184, 64>;
+  using CLayout = GMMA::CLayout_64x184;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x200x64_F16E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_200,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<200, 64>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x200x64_F16E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_200,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<200, 64>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x200x64_F32E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_200,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<200, 64>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x200x64_F32E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_200,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<200, 64>;
+  using CLayout = GMMA::CLayout_64x200;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x208x64_F16E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<208, 64>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x208x64_F16E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<208, 64>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x208x64_F32E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<208, 64>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x208x64_F32E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_208,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<208, 64>;
+  using CLayout = GMMA::CLayout_64x208;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x216x64_F16E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_216,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<216, 64>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x216x64_F16E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_216,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<216, 64>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x216x64_F32E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_216,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<216, 64>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x216x64_F32E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_216,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<216, 64>;
+  using CLayout = GMMA::CLayout_64x216;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x224x64_F16E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<224, 64>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x224x64_F16E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<224, 64>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x224x64_F32E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<224, 64>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x224x64_F32E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_224,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<224, 64>;
+  using CLayout = GMMA::CLayout_64x224;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x232x64_F16E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_232,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<232, 64>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x232x64_F16E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_232,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<232, 64>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x232x64_F32E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_232,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<232, 64>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x232x64_F32E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_232,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<232, 64>;
+  using CLayout = GMMA::CLayout_64x232;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x240x64_F16E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<240, 64>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x240x64_F16E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<240, 64>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x240x64_F32E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<240, 64>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x240x64_F32E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_240,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<240, 64>;
+  using CLayout = GMMA::CLayout_64x240;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x248x64_F16E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_248,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<248, 64>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x248x64_F16E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = half_t;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = half_t;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_248,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<248, 64>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x248x64_F32E5M2E5M2_SS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeA = GMMA::smem_desc<GMMA::Major::K>;
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_248,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ABLayout< 64, 64>;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<248, 64>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+template <GMMA::ScaleIn scaleA, GMMA::ScaleIn scaleB, GMMA::SparseSel spsel>
+struct MMA_Traits<SM90::GMMA::SPARSE::GMMA_64x248x64_F32E5M2E5M2_RS_TN<scaleA, scaleB, spsel>>
+{
+  using ValTypeD = float;
+  using ValTypeA = sparse_elem<2, float_e5m2_t>;
+  using ValTypeE = sparse_elem<8, uint8_t>;
+  using ValTypeB = float_e5m2_t;
+  using ValTypeC = float;
+
+  using FrgTypeB = GMMA::smem_desc<GMMA::Major::K>;
+
+  using Shape_MNK = Shape<_64,_248,_64>;
+  using ThrID   = Layout<_128>;
+  using ALayout = GMMA::ALayout_64x64;
+  using ELayout = GMMA::ELayout_64x64;
+  using BLayout = GMMA::ABLayout<248, 64>;
+  using CLayout = GMMA::CLayout_64x248;
+
+  GMMA::ScaleOut accumulate_ = GMMA::ScaleOut::One;
+};
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+} // end namespace cute
diff --git a/include/cute/config.hpp b/include/cute/config.hpp
index e4c7db5ca6..6b79bedf39 100644
--- a/include/cute/config.hpp
+++ b/include/cute/config.hpp
@@ -148,21 +148,8 @@
 #  include <iomanip>
 #endif
 
-//
-// Support
-//
-
-#include <cute/util/type_traits.hpp>
-
-//
-// Basic types
-//
-
-#include <cute/numeric/numeric_types.hpp>
-
 //
 // Debugging utilities
 //
 
-#include <cute/util/print.hpp>
 #include <cute/util/debug.hpp>
diff --git a/include/cute/container/alignment.hpp b/include/cute/container/alignment.hpp
index 4cf60d899f..52e4cbadd9 100644
--- a/include/cute/container/alignment.hpp
+++ b/include/cute/container/alignment.hpp
@@ -54,17 +54,17 @@ is_byte_aligned(void const* const ptr)
 #  define CUTE_ALIGNAS(n) alignas(n)
 #endif
 
-template <size_t Alignment>
+template <size_t Alignment, class Child = void>
 struct aligned_struct {};
 
-template <> struct CUTE_ALIGNAS(  1) aligned_struct<  1> {};
-template <> struct CUTE_ALIGNAS(  2) aligned_struct<  2> {};
-template <> struct CUTE_ALIGNAS(  4) aligned_struct<  4> {};
-template <> struct CUTE_ALIGNAS(  8) aligned_struct<  8> {};
-template <> struct CUTE_ALIGNAS( 16) aligned_struct< 16> {};
-template <> struct CUTE_ALIGNAS( 32) aligned_struct< 32> {};
-template <> struct CUTE_ALIGNAS( 64) aligned_struct< 64> {};
-template <> struct CUTE_ALIGNAS(128) aligned_struct<128> {};
-template <> struct CUTE_ALIGNAS(256) aligned_struct<256> {};
+template <class Child> struct CUTE_ALIGNAS(  1) aligned_struct<  1, Child> {};
+template <class Child> struct CUTE_ALIGNAS(  2) aligned_struct<  2, Child> {};
+template <class Child> struct CUTE_ALIGNAS(  4) aligned_struct<  4, Child> {};
+template <class Child> struct CUTE_ALIGNAS(  8) aligned_struct<  8, Child> {};
+template <class Child> struct CUTE_ALIGNAS( 16) aligned_struct< 16, Child> {};
+template <class Child> struct CUTE_ALIGNAS( 32) aligned_struct< 32, Child> {};
+template <class Child> struct CUTE_ALIGNAS( 64) aligned_struct< 64, Child> {};
+template <class Child> struct CUTE_ALIGNAS(128) aligned_struct<128, Child> {};
+template <class Child> struct CUTE_ALIGNAS(256) aligned_struct<256, Child> {};
 
 } // end namespace cute
diff --git a/include/cute/container/array_aligned.hpp b/include/cute/container/array_aligned.hpp
index 9895a8da77..a9d14a1a25 100644
--- a/include/cute/container/array_aligned.hpp
+++ b/include/cute/container/array_aligned.hpp
@@ -30,8 +30,8 @@
  **************************************************************************************************/
 #pragma once
 
-#include <cute/container/array.hpp>
-#include <cute/container/alignment.hpp>
+#include <cute/container/alignment.hpp>  // CUTE_ALIGNAS
+#include <cute/container/array.hpp>      // cute::array
 
 namespace cute
 {
diff --git a/include/cute/container/array_subbyte.hpp b/include/cute/container/array_subbyte.hpp
index 1963d8ce7b..57db56aba5 100644
--- a/include/cute/container/array_subbyte.hpp
+++ b/include/cute/container/array_subbyte.hpp
@@ -176,11 +176,26 @@ struct subbyte_reference
   }
 
   // Address
+  CUTE_HOST_DEVICE
   subbyte_iterator<T> operator&() const {
     return {ptr_, idx_};
   }
 };
 
+template <class T>
+CUTE_HOST_DEVICE
+void
+print(subbyte_reference<T> ref) {
+  cute::print(ref.get());
+}
+
+template <class T>
+CUTE_HOST_DEVICE
+void
+pretty_print(subbyte_reference<T> ref) {
+  cute::pretty_print(ref.get());
+}
+
 //
 // subbyte_iterator
 //   Random-access iterator over subbyte references
@@ -332,6 +347,11 @@ print(subbyte_iterator<T> const& x) {
   printf("subptr[%db](%p.%u)", int(sizeof_bits_v<T>), x.ptr_, x.idx_);
 }
 
+template <class T>
+CUTE_HOST_DEVICE void
+print(subbyte_reference<T> const& x) {
+  print(x.get());
+}
 //
 // array_subbyte
 //   Statically sized array for non-byte-aligned data types
diff --git a/include/cute/container/bit_field.hpp b/include/cute/container/bit_field.hpp
index c5748d84c3..d7fac42a54 100644
--- a/include/cute/container/bit_field.hpp
+++ b/include/cute/container/bit_field.hpp
@@ -35,9 +35,9 @@
 
 #pragma once
 
-#include <cute/config.hpp>
-
+#include <cute/config.hpp>                  // CUTE_HOST_DEVICE
 #include <cute/numeric/numeric_types.hpp>   // uint_bit_t
+#include <cute/util/type_traits.hpp>        // cute::is_same
 
 namespace cute
 {
diff --git a/include/cute/container/cuda_types.hpp b/include/cute/container/cuda_types.hpp
index 8034cb271d..fbc314e543 100644
--- a/include/cute/container/cuda_types.hpp
+++ b/include/cute/container/cuda_types.hpp
@@ -30,12 +30,8 @@
  **************************************************************************************************/
 #pragma once
 
-#include <vector_types.h>
-
-#include <cute/config.hpp>
-
-#include <cute/util/type_traits.hpp>
-#include <cute/numeric/integral_constant.hpp>
+#include <cute/config.hpp>                     // CUTE_HOST_DEVICE, CUTE_GCC_UNREACHABLE
+#include <cute/numeric/integral_constant.hpp>  // cute::integral_constant
 
 namespace cute
 {
diff --git a/include/cute/container/tuple.hpp b/include/cute/container/tuple.hpp
index e8172299fa..42d9da9c92 100644
--- a/include/cute/container/tuple.hpp
+++ b/include/cute/container/tuple.hpp
@@ -636,14 +636,23 @@ template <class Tuple, size_t... Is>
 CUTE_HOST_DEVICE void print_tuple(Tuple const& t, index_sequence<Is...>, char s = '(', char e = ')')
 {
   using cute::print;
-  print(s); ((void(print(Is == 0 ? '\0' : ',')), void(print(get<Is>(t)))), ...); print(e);
+  if (sizeof...(Is) == 0) {
+    print(s);
+  } else {
+    ((void(print(Is == 0 ? s : ',')), void(print(get<Is>(t)))), ...);
+  }
+  print(e);
 }
 
 #if !defined(__CUDACC_RTC__)
 template <class Tuple, std::size_t... Is>
 CUTE_HOST std::ostream& print_tuple_os(std::ostream& os, Tuple const& t, index_sequence<Is...>, char s = '(', char e = ')')
 {
-  os << s; (void(os << (Is == 0 ? '\0' : ',') << get<Is>(t)), ...);
+  if (sizeof...(Is) == 0) {
+    os << s;
+  } else {
+    (void(os << (Is == 0 ? s : ',') << get<Is>(t)), ...);
+  }
   return os << e;
 }
 #endif // !defined(__CUDACC_RTC__)
diff --git a/include/cute/container/type_list.hpp b/include/cute/container/type_list.hpp
index 2db934356b..a15f2c1c15 100644
--- a/include/cute/container/type_list.hpp
+++ b/include/cute/container/type_list.hpp
@@ -30,8 +30,7 @@
  **************************************************************************************************/
 #pragma once
 
-#include <cute/config.hpp>
-#include <cute/util/type_traits.hpp>
+#include <cute/config.hpp>            // CUTE_HOST_DEVICE, CUTE_STL_NAMESPACE
 
 namespace cute
 {
diff --git a/include/cute/int_tuple.hpp b/include/cute/int_tuple.hpp
index ceafba0d80..132e103830 100644
--- a/include/cute/int_tuple.hpp
+++ b/include/cute/int_tuple.hpp
@@ -30,21 +30,17 @@
  **************************************************************************************************/
 #pragma once
 
-#include <cute/config.hpp>
-
-#include <cute/container/tuple.hpp>
-#include <cute/container/array.hpp>
-#include <cute/algorithm/tuple_algorithms.hpp>
-#include <cute/numeric/integral_constant.hpp>
+#include <cute/config.hpp>                      // CUTE_HOST_DEVICE
+#include <cute/container/array.hpp>             // cute::array
+#include <cute/container/tuple.hpp>             // cute::is_tuple
+#include <cute/numeric/integral_constant.hpp>   // cute::Int
 
 /** IntTuple is an integer or a tuple of IntTuples.
  * This file holds utilities for working with IntTuples,
  * but does not hold a concrete concept or class of IntTuple.
  */
 
-namespace cute
-{
-
+namespace cute {
 // Implementation of get<0>(Integral).
 //   Even though is_tuple<Integral> is false and tuple_size<Integral> doesn't compile,
 //   CuTe defines rank(Integral) as 1, so it's useful for get<0>(Integral) to return its input
@@ -66,6 +62,12 @@ get(T&& t) noexcept
   return get<I1, Is...>(get<I0>(static_cast<T&&>(t)));
 }
 
+}
+
+#include <cute/algorithm/tuple_algorithms.hpp>  // cute::transform
+
+namespace cute {
+
 //
 // rank
 //
@@ -92,7 +94,7 @@ template <class IntTuple>
 using rank_t = decltype(rank(declval<IntTuple>()));
 
 template <class IntTuple>
-static constexpr int rank_v = rank_t<IntTuple>::value;
+static constexpr auto rank_v = rank_t<IntTuple>::value;
 
 //
 // shape
@@ -212,7 +214,7 @@ template <class Tuple>
 using depth_t = decltype(depth(declval<Tuple>()));
 
 template <class Tuple>
-static constexpr int depth_v = depth_t<Tuple>::value;
+static constexpr auto depth_v = depth_t<Tuple>::value;
 
 //
 // product
@@ -276,7 +278,7 @@ size(IntTuple const& a)
 }
 
 template <class IntTuple>
-static constexpr int size_v = decltype(size(declval<IntTuple>()))::value;
+static constexpr auto size_v = decltype(size(declval<IntTuple>()))::value;
 
 //
 // sum
@@ -522,68 +524,31 @@ compatible(IntTupleA const& a, IntTupleB const& b)
 template <class A, class B>
 using is_compatible = decltype(compatible(declval<A>(), declval<B>()));
 
-/** Test if Shape A is weakly compatible with Shape B:
- *    there exists a Shape C congruent to A such that compatible(elem_scale(A,C), B)
- * Equivalently, the size of Shape B is a multiple of Shape A at each terminal of Shape A.
- * weakly_compatible is a partial order on A and B: A <= B
- */
-template <class IntTupleA, class IntTupleB>
-CUTE_HOST_DEVICE constexpr
-auto
-weakly_compatible(IntTupleA const& a, IntTupleB const& b)
-{
-  if constexpr (is_tuple<IntTupleA>::value && is_tuple<IntTupleB>::value) {
-    if constexpr (tuple_size<IntTupleA>::value != tuple_size<IntTupleB>::value) {
-      return false_type{};
-    } else {
-      return transform_apply(a, b, [](auto const& x, auto const& y) { return weakly_compatible(x,y); },
-                                   [](auto const&... z) { return (true_type{} && ... && z); });
-    }
-  } else if constexpr (is_integral<IntTupleA>::value) {
-    return size(b) % a == Int<0>{};
-  } else if constexpr (is_integral<IntTupleB>::value) {
-    return false_type{};
-  } else {
-    return weakly_compatible(shape(a), shape(b));
-  }
-
-  CUTE_GCC_UNREACHABLE;
-}
-
-template <class A, class B>
-using is_weakly_compatible = decltype(weakly_compatible(declval<A>(), declval<B>()));
-
-/** Test if Shape A is softly compatible with Shape B:
- *    there exists a Shape C congruent to A such that compatible(shape_div(A,C), B)
- * Equivalently, the size of Shape B divides Shape A at each terminal of Shape A.
- * softly_compatible is a partial order on A and B: A <= B
+/** Test if Shape A is evenly divided by Tiler B
+ * @returns Static or dynamic boolean
+ * @post if result is true_type, then
+ *       size(a) == logical_divide(make_layout(shape(a)),b) will always compile
+ *       and result in true_type.
  */
-template <class IntTupleA, class IntTupleB>
+template <class Shape, class Tiler>
 CUTE_HOST_DEVICE constexpr
 auto
-softly_compatible(IntTupleA const& a, IntTupleB const& b)
+evenly_divides(Shape const& a, Tiler const& b)
 {
-  if constexpr (is_tuple<IntTupleA>::value && is_tuple<IntTupleB>::value) {
-    if constexpr (tuple_size<IntTupleA>::value != tuple_size<IntTupleB>::value) {
+  if constexpr (is_tuple<Tiler>::value) {
+    if constexpr (rank_v<Tiler> > rank_v<Shape>) {
       return false_type{};
     } else {
-      return transform_apply(a, b, [](auto const& x, auto const& y) { return softly_compatible(x,y); },
+      return transform_apply(b, a, [](auto const& x, auto const& y) { return evenly_divides(y,x); },
                                    [](auto const&... z) { return (true_type{} && ... && z); });
     }
-  } else if constexpr (is_integral<IntTupleA>::value) {
-    return a % size(b) == Int<0>{};
-  } else if constexpr (is_integral<IntTupleB>::value) {
-    return false_type{};
   } else {
-    return softly_compatible(shape(a), shape(b));
+    return size(a) == size(b) * size(ceil_div(shape(a), b));
   }
 
   CUTE_GCC_UNREACHABLE;
 }
 
-template <class A, class B>
-using is_softly_compatible = decltype(softly_compatible(declval<A>(), declval<B>()));
-
 /** Replace the elements of Tuple B that are paired with an Int<0> with an Int<1>
  */
 template <class IntTupleA, class IntTupleB>
@@ -594,7 +559,7 @@ filter_zeros(IntTupleA const& a, IntTupleB const& b)
   if constexpr (is_tuple<IntTupleA>::value) {
     return transform(a, b, [](auto const& x, auto const& y) { return filter_zeros(x,y); });
   } else if constexpr (is_constant<0, IntTupleA>::value) {
-    return Int<1>{};
+    return repeat_like(b, Int<1>{});
   } else {
     return b;
   }
@@ -899,92 +864,4 @@ elem_geq(T const& t, U const& u) {
   return !elem_less(t, u);
 }
 
-namespace detail {
-
-/** Increment a (dynamic) coord lexicographically within a shape
- * @pre is_congruent<Coord,Shape>::value
- * \code
- *    auto shape = make_shape(1,2,make_shape(2,3),3);
- *
- *   int i = 0;
- *   for (auto coord = repeat_like(shape, 0); back(coord) != back(shape); increment(coord, shape)) {
- *      std::cout << i++ << ": " << coord << std::endl;
- *   }
- *   assert(i == size(shape));
- * \endcode
- */
-template <int I = 0, class Coord, class Shape>
-CUTE_HOST_DEVICE constexpr
-void
-increment(Coord& coord, Shape const& shape)
-{
-  if constexpr (is_integral<Coord>::value) {
-    ++coord;
-  } else {
-    increment(get<I>(coord), get<I>(shape));
-    if constexpr (I+1 < tuple_size<Coord>::value) {
-      if (back(get<I>(coord)) == back(get<I>(shape))) {
-        back(get<I>(coord)) = 0;
-        increment<I+1>(coord, shape);
-      }
-    }
-  }
-}
-
-} // end namespace detail
-
-struct ForwardCoordIteratorSentinal
-{};
-
-// A forward iterator for a starting coordinate in a shape's domain, and a shape.
-// The starting coordinate may be zero but need not necessarily be.
-template <class Coord, class Shape>
-struct ForwardCoordIterator
-{
-  static_assert(is_congruent<Coord, Shape>::value);
-
-  CUTE_HOST_DEVICE constexpr
-  Coord const& operator*() const { return coord; }
-
-  CUTE_HOST_DEVICE constexpr
-  ForwardCoordIterator& operator++() { detail::increment(coord, shape); return *this; }
-
-  // Sentinel for the end of the implied range
-  CUTE_HOST_DEVICE constexpr
-  bool operator< (ForwardCoordIteratorSentinal const&) const { return back(coord) <  back(shape); }
-  CUTE_HOST_DEVICE constexpr
-  bool operator==(ForwardCoordIteratorSentinal const&) const { return back(coord) == back(shape); }
-  CUTE_HOST_DEVICE constexpr
-  bool operator!=(ForwardCoordIteratorSentinal const&) const { return back(coord) != back(shape); }
-  // NOTE: These are expensive, avoid use
-  CUTE_HOST_DEVICE constexpr
-  bool operator< (ForwardCoordIterator const& other) const { return colex_less(coord, other.coord); }
-  CUTE_HOST_DEVICE constexpr
-  bool operator==(ForwardCoordIterator const& other) const { return coord == other.coord; }
-  CUTE_HOST_DEVICE constexpr
-  bool operator!=(ForwardCoordIterator const& other) const { return coord != other.coord; }
-
-  Coord coord;
-  Shape const& shape;
-};
-
-// A forward iterator for a coordinate that starts from a provided coordinate
-template <class Shape, class Coord>
-CUTE_HOST_DEVICE constexpr
-auto
-make_coord_iterator(Coord const& coord, Shape const& shape)
-{
-  return ForwardCoordIterator<Coord,Shape>{coord,shape};
-}
-
-// A forward iterator for a coordinate that starts from zero
-template <class Shape>
-CUTE_HOST_DEVICE constexpr
-auto
-make_coord_iterator(Shape const& shape)
-{
-  auto coord = repeat_like(shape, int(0));
-  return make_coord_iterator(coord, shape);
-}
-
 } // end namespace cute
diff --git a/include/cute/layout.hpp b/include/cute/layout.hpp
index 60581192b0..bc1b54efbc 100644
--- a/include/cute/layout.hpp
+++ b/include/cute/layout.hpp
@@ -31,13 +31,13 @@
 #pragma once
 
 #include <cute/config.hpp>
-
-#include <cute/underscore.hpp>
 #include <cute/int_tuple.hpp>
 #include <cute/stride.hpp>
+#include <cute/underscore.hpp>
 #include <cute/numeric/arithmetic_tuple.hpp>
-#include <cute/numeric/integral_ratio.hpp>
 #include <cute/numeric/integral_constant.hpp>
+#include <cute/numeric/integral_ratio.hpp>
+#include <cute/numeric/numeric_types.hpp>  // cute::sizeof_bits
 
 namespace cute
 {
@@ -660,7 +660,7 @@ template <class Layout>
 using cosize_t = decltype(cosize(declval<Layout>()));
 
 template <class Layout>
-static constexpr int cosize_v = cosize_t<Layout>::value;
+static constexpr auto cosize_v = cosize_t<Layout>::value;
 
 // With crd2idx(coord, shape), makes sense to have crd2idx(coord, Layout) as well
 template <class Coord, class Shape, class Stride>
@@ -905,6 +905,15 @@ filter_zeros(Layout<Shape,Stride> const& layout)
   return make_layout(filter_zeros(layout.stride(), layout.shape()), layout.stride());
 }
 
+// Replace the modes in layout that correspond to a 0 at the terminals of trg_profile with a 1-size
+template <class Shape, class Stride, class IntTuple>
+CUTE_HOST_DEVICE constexpr
+auto
+filter_zeros(Layout<Shape,Stride> const& layout, IntTuple const& trg_profile)
+{
+  return make_layout(filter_zeros(trg_profile, layout.shape()), layout.stride());
+}
+
 // Remove all of the 0-strides and 1-sizes
 // Return 1-shape if empty
 template <class Shape, class Stride>
@@ -1350,7 +1359,8 @@ max_common_vector(Layout<ShapeA,StrideA> const& a,
 /* Return a layout that distributes ShapeB over ShapeA.
  *
  * @returns Layout result
- * @post softly_compatible(@a b, @a result)
+ * @post evenly_divides(@a b, size(@a result))
+ * @post evenly_divides(@a a, @a result)
  * @post For all i,j in [0,size(@a result)) with i < j, @a result(i) < @a result(j). Surjective and Ordered.
  * @post composition(make_layout(shape(@a a)), @a result) is admissible
  * \code
@@ -1726,8 +1736,8 @@ tile_to_shape(Layout<Shape,Stride> const& block,
 
   // Assert proper division
   if constexpr (is_static<decltype(target_shape)>::value) {
-    CUTE_STATIC_ASSERT_V(weakly_compatible(block_shape, target_shape),
-                        "tile_to_shape: block shape does not divide the target shape.");
+    CUTE_STATIC_ASSERT_V(evenly_divides(target_shape, block_shape),
+                         "tile_to_shape: block shape does not divide the target shape.");
   }
 
   auto product_shape = ceil_div(target_shape, block_shape);
@@ -1924,92 +1934,97 @@ print_layout(Layout const& layout, ThrID const& thrid)  // (m,n) -> (tid,vid)  a
   printf("+\n");
 }
 
-// Generic 2D Layout to Latex printer -- B&W 8-value color coding
-template <class LayoutA>
+struct TikzColor_White {
+  CUTE_HOST_DEVICE char const*
+  operator()(int idx) const {
+    return "white";
+  }
+};
+
+struct TikzColor_BWx8 {
+  CUTE_HOST_DEVICE char const*
+  operator()(int idx) const {
+    static char const* color_map[8] = {"black!00", "black!40", "black!20", "black!60",
+                                       "black!10", "black!50", "black!30", "black!70"};
+    return color_map[idx % 8];
+  }
+};
+
+struct TikzColor_TV {
+  CUTE_HOST_DEVICE char const*
+  operator()(int tid, int vid) const {
+    static char const* color_map[8] = {"{rgb,255:red,175;green,175;blue,255}",
+                                       "{rgb,255:red,175;green,255;blue,175}",
+                                       "{rgb,255:red,255;green,255;blue,175}",
+                                       "{rgb,255:red,255;green,175;blue,175}",
+                                       "{rgb,255:red,210;green,210;blue,255}",
+                                       "{rgb,255:red,210;green,255;blue,210}",
+                                       "{rgb,255:red,255;green,255;blue,210}",
+                                       "{rgb,255:red,255;green,210;blue,210}"};
+    return color_map[tid % 8];
+  }
+};
+
+// Generic 2D Layout to LaTeX printer
+template <class LayoutA, class TikzColorFn = TikzColor_BWx8>
 CUTE_HOST_DEVICE
 void
-print_latex(LayoutA const& layout_a)
+print_latex(LayoutA const& layout_a,   // (m,n) -> idx
+            TikzColorFn color = {})    // lambda(idx) -> tikz color string
 {
   CUTE_STATIC_ASSERT_V(rank(layout_a) <= Int<2>{});
   auto layout = append<2>(layout_a, Layout<_1,_0>{});
 
-  char const* latex_header =
-      "\\documentclass[convert]{standalone}\n"
-      "\\usepackage{tikz}\n\n"
-      "\\begin{document}\n"
-      "\\begin{tikzpicture}[x={(0cm,-1cm)},y={(1cm,0cm)},box/.style={rectangle,draw=black,thick,minimum size=1cm,anchor=center,font=\\Large}]\n\n";
-  char const* latex_footer =
-      "\\end{tikzpicture}\n"
-      "\\end{document}\n";
-
-  char const* color_map[8] = {"black!00",
-                              "black!40",
-                              "black!20",
-                              "black!60",
-                              "black!10",
-                              "black!50",
-                              "black!30",
-                              "black!70"};
-
-  // Header
+  // Commented print(layout)
   printf("%% Layout: "); print(layout); printf("\n");
-
-  printf(latex_header);
+  // Header
+  printf("\\documentclass[convert]{standalone}\n"
+         "\\usepackage{tikz}\n\n"
+         "\\begin{document}\n"
+         "\\begin{tikzpicture}[x={(0cm,-1cm)},y={(1cm,0cm)},every node/.style={minimum size=1cm, outer sep=0pt}]\n\n");
 
   // Layout
   for (int i = 0; i < size<0>(layout); ++i) {
     for (int j = 0; j < size<1>(layout); ++j) {
       int idx = layout(i,j);
-      printf("\\node[box,fill=%s] at (%d,%d) {%d};\n",
-             color_map[idx % 8],
-             i, j,
-             idx);
+      printf("\\node[fill=%s] at (%d,%d) {%d};\n",
+             color(idx), i, j, idx);
     }
   }
-
+  // Grid
+  printf("\\draw[color=black,thick,shift={(-0.5,-0.5)}] (0,0) grid (%d,%d);\n\n",
+         int(size<0>(layout)), int(size<1>(layout)));
   // Labels
-  for (int i = 0, j = -1; i < size<0>(layout); ++i) {
+  for (int i =  0, j = -1; i < size<0>(layout); ++i) {
     printf("\\node at (%d,%d) {\\Large{\\texttt{%d}}};\n", i, j, i);
   }
-  for (int j = 0, i = -1; j < size<1>(layout); ++j) {
+  for (int i = -1, j =  0; j < size<1>(layout); ++j) {
     printf("\\node at (%d,%d) {\\Large{\\texttt{%d}}};\n", i, j, j);
   }
 
   // Footer
-  printf(latex_footer);
+  printf("\\end{tikzpicture}\n"
+         "\\end{document}\n");
 }
 
-// Generic ThrVal 2D Layout to Latex TIKZ -- 8-value color coded by thread
-template <class Layout, class ThrID>
+// Generic ThrVal 2D Layout to LaTeX TikZ
+template <class Layout, class ThrID, class TikzColorFn = TikzColor_TV>
 CUTE_HOST_DEVICE
 void
-print_latex(Layout const& layout, ThrID const& thr)  // (m,n) -> (tid,vid)  and  tid -> thr_idx
+print_latex(Layout const& layout,    // (m,n) -> (tid,vid)
+            ThrID  const& thr,       // tid -> thr_idx
+            TikzColorFn color = {})  // lambda(thr_idx,val_idx) -> tikz color string
 {
   CUTE_STATIC_ASSERT_V(rank(layout) == Int<2>{});
 
-  char const* latex_header =
-      "\\documentclass[convert]{standalone}\n"
-      "\\usepackage{tikz}\n\n"
-      "\\begin{document}\n"
-      "\\begin{tikzpicture}[x={(0cm,-1cm)},y={(1cm,0cm)},box/.style={rectangle,draw=black,thick,minimum size=1cm,anchor=center}]\n\n";
-  char const* latex_footer =
-      "\\end{tikzpicture}\n"
-      "\\end{document}\n";
-
-  char const* color_map[8] = {"{rgb,255:red,175;green,175;blue,255}",
-                              "{rgb,255:red,175;green,255;blue,175}",
-                              "{rgb,255:red,255;green,255;blue,175}",
-                              "{rgb,255:red,255;green,175;blue,175}",
-                              "{rgb,255:red,210;green,210;blue,255}",
-                              "{rgb,255:red,210;green,255;blue,210}",
-                              "{rgb,255:red,255;green,255;blue,210}",
-                              "{rgb,255:red,255;green,210;blue,210}"};
-
+  // Commented prints
+  printf("%% Layout: "); print(layout); printf("\n");
+  printf("%% ThrID : "); print(thr);  printf("\n");
   // Header
-  printf("%% layout: "); print(layout); printf("\n");
-  printf("%% thrid:  "); print(thr);    printf("\n\n");
-
-  printf(latex_header);
+  printf("\\documentclass[convert]{standalone}\n"
+         "\\usepackage{tikz}\n\n"
+         "\\begin{document}\n"
+         "\\begin{tikzpicture}[x={(0cm,-1cm)},y={(1cm,0cm)},every node/.style={minimum size=1cm, outer sep=0pt}]\n\n");
 
   // Layout
   for (int i = 0; i < size<0>(layout); ++i) {
@@ -2018,13 +2033,15 @@ print_latex(Layout const& layout, ThrID const& thr)  // (m,n) -> (tid,vid)  and
       int val_idx = layout(i,j) / size(thr);
       int thr_idx = thr(thrid);
 
-      printf("\\node[box,fill=%s] at (%d,%d) {\\shortstack{T%d \\\\ V%d}};\n",
-             color_map[thr_idx % 8],
+      printf("\\node[fill=%s] at (%d,%d) {\\shortstack{T%d \\\\ V%d}};\n",
+             color(thr_idx, val_idx),
              i, j,
              thr_idx, val_idx);
     }
   }
-
+  // Grid
+  printf("\\draw[color=black,thick,shift={(-0.5,-0.5)}] (0,0) grid (%d,%d);\n\n",
+         int(size<0>(layout)), int(size<1>(layout)));
   // Labels
   for (int i = 0, j = -1; i < size<0>(layout); ++i) {
     printf("\\node at (%d,%d) {\\Large{\\texttt{%d}}};\n", i, j, i);
@@ -2034,13 +2051,8 @@ print_latex(Layout const& layout, ThrID const& thr)  // (m,n) -> (tid,vid)  and
   }
 
   // Footer
-  printf(latex_footer);
+  printf("\\end{tikzpicture}\n"
+         "\\end{document}\n");
 }
 
 } // end namespace cute
-
-//
-// Extended Layouts
-//
-
-#include <cute/swizzle_layout.hpp>
diff --git a/include/cute/layout_composed.hpp b/include/cute/layout_composed.hpp
index fb62541cb4..3e5f836279 100644
--- a/include/cute/layout_composed.hpp
+++ b/include/cute/layout_composed.hpp
@@ -30,9 +30,9 @@
  **************************************************************************************************/
 #pragma once
 
-#include <cute/config.hpp>
-
-#include <cute/layout.hpp>
+#include <cute/config.hpp>                     // CUTE_HOST_DEVICE, CUTE_GCC_UNREACHABLE
+#include <cute/layout.hpp>                     // cute::tuple
+#include <cute/numeric/integral_constant.hpp>  // cute::true_type, cute::false_type, cute::Int
 
 /* This implements a ComposedLayout of the form
  *   LayoutA o Offset o LayoutB
diff --git a/include/cute/numeric/arithmetic_tuple.hpp b/include/cute/numeric/arithmetic_tuple.hpp
index 651ff8e887..2e46905719 100644
--- a/include/cute/numeric/arithmetic_tuple.hpp
+++ b/include/cute/numeric/arithmetic_tuple.hpp
@@ -197,7 +197,7 @@ struct ArithmeticTupleIterator
   ArithmeticTupleIterator(ArithTuple const& coord = {}) : coord_(coord) {}
 
   CUTE_HOST_DEVICE constexpr
-  ArithTuple const& operator*() const { return coord_; }
+  ArithTuple operator*() const { return coord_; }
 
   template <class Coord>
   CUTE_HOST_DEVICE constexpr
@@ -206,7 +206,7 @@ struct ArithmeticTupleIterator
   template <class Coord>
   CUTE_HOST_DEVICE constexpr
   auto operator+(Coord const& c) const {
-    return ArithmeticTupleIterator<decltype(coord_ + c)>(coord_ + c);
+    return ArithmeticTupleIterator<remove_cvref_t<decltype(coord_ + c)>>(coord_ + c);
   }
 };
 
@@ -268,13 +268,13 @@ basis_value(SB const& e)
 
 // Apply the N... pack to another Tuple
 template <class SB, class Tuple>
-CUTE_HOST_DEVICE constexpr auto
-basis_get(SB const& e, Tuple const& t)
+CUTE_HOST_DEVICE decltype(auto)
+basis_get(SB const& e, Tuple&& t)
 {
   if constexpr (is_scaled_basis<SB>::value) {
-    return basis_get(e.value(), get<SB::mode()>(t));
+    return basis_get(e.value(), get<SB::mode()>(static_cast<Tuple&&>(t)));
   } else {
-    return t;
+    return static_cast<Tuple&&>(t);
   }
   CUTE_GCC_UNREACHABLE;
 }
diff --git a/include/cute/numeric/complex.hpp b/include/cute/numeric/complex.hpp
index 5aa6664a89..7dd9ea5bf0 100644
--- a/include/cute/numeric/complex.hpp
+++ b/include/cute/numeric/complex.hpp
@@ -30,9 +30,9 @@
  **************************************************************************************************/
 #pragma once
 
-#include <cutlass/complex.h>
-#include <cute/util/type_traits.hpp>
-#include <cute/numeric/numeric_types.hpp>
+#include <cute/config.hpp>    // CUTE_HOST_DEVICE
+
+#include <cutlass/complex.h>  // cutlass::complexm, cutlass::real, cutlass::imag, cutlass::is_complex
 
 namespace cute
 {
diff --git a/include/cute/numeric/int.hpp b/include/cute/numeric/int.hpp
index 169e3a0e67..571b3e3ed0 100644
--- a/include/cute/numeric/int.hpp
+++ b/include/cute/numeric/int.hpp
@@ -36,7 +36,9 @@
 #include <cstdint>
 #endif
 
-#include <cutlass/numeric_types.h>
+#include <cute/config.hpp>          // CUTE_STL_NAMESPACE
+
+#include <cutlass/numeric_types.h>  // cutlass::int2b_t, cutlass::int4b_t
 
 namespace cute
 {
@@ -53,8 +55,8 @@ using CUTE_STL_NAMESPACE::int32_t;
 using CUTE_STL_NAMESPACE::int64_t;
 
 template <int N> struct int_bit;
-template <> struct int_bit<  2>  { using type = cutlass::int2b_t; };
-template <> struct int_bit<  4>  { using type = cutlass::int4b_t; };
+template <> struct int_bit<  2>  { using type = int2_t; };
+template <> struct int_bit<  4>  { using type = int4_t; };
 template <> struct int_bit<  8>  { using type = int8_t;  };
 template <> struct int_bit< 16>  { using type = int16_t; };
 template <> struct int_bit< 32>  { using type = int32_t; };
@@ -83,9 +85,9 @@ using CUTE_STL_NAMESPACE::uint64_t;
 using cutlass::uint128_t;
 
 template <int N> struct uint_bit;
-template <> struct uint_bit<  1> { using type = cutlass::uint1b_t; };
-template <> struct uint_bit<  2> { using type = cutlass::uint2b_t; };
-template <> struct uint_bit<  4> { using type = cutlass::uint4b_t; };
+template <> struct uint_bit<  1> { using type = uint1_t; };
+template <> struct uint_bit<  2> { using type = uint2_t; };
+template <> struct uint_bit<  4> { using type = uint4_t; };
 template <> struct uint_bit<  8> { using type = uint8_t;  };
 template <> struct uint_bit< 16> { using type = uint16_t; };
 template <> struct uint_bit< 32> { using type = uint32_t; };
diff --git a/include/cute/numeric/integral_constant.hpp b/include/cute/numeric/integral_constant.hpp
index 46863ac286..88b00922f7 100644
--- a/include/cute/numeric/integral_constant.hpp
+++ b/include/cute/numeric/integral_constant.hpp
@@ -30,10 +30,9 @@
  **************************************************************************************************/
 #pragma once
 
-#include "cute/util/print.hpp"
-#include "cute/util/type_traits.hpp"
-#include "cute/numeric/math.hpp"
-#include "cutlass/fast_math.h"
+#include <cute/numeric/math.hpp>      // cute::max, etc
+#include <cute/util/print.hpp>        // cute::print
+#include <cute/util/type_traits.hpp>  // __CUTE_REQUIRES, cute::is_std_integral
 
 namespace cute
 {
@@ -65,7 +64,7 @@ struct integral_constant : C<v> {
   static constexpr T value = v;
   using value_type = T;
   // Disambiguate C<v>::operator value_type()
-  //CUTE_HOST_DEVICE constexpr operator   value_type() const noexcept { return value; }  
+  //CUTE_HOST_DEVICE constexpr operator   value_type() const noexcept { return value; }
   CUTE_HOST_DEVICE constexpr value_type operator()() const noexcept { return value; }
 };
 
@@ -147,19 +146,33 @@ using _12     = Int<12>;
 using _16     = Int<16>;
 using _24     = Int<24>;
 using _32     = Int<32>;
+using _40     = Int<40>;
 using _48     = Int<48>;
+using _56     = Int<56>;
 using _64     = Int<64>;
+using _72     = Int<72>;
 using _80     = Int<80>;
+using _88     = Int<88>;
 using _96     = Int<96>;
+using _104    = Int<104>;
 using _112    = Int<112>;
+using _120    = Int<120>;
 using _128    = Int<128>;
+using _136    = Int<136>;
 using _144    = Int<144>;
+using _152    = Int<152>;
 using _160    = Int<160>;
+using _168    = Int<168>;
 using _176    = Int<176>;
+using _184    = Int<184>;
 using _192    = Int<192>;
+using _200    = Int<200>;
 using _208    = Int<208>;
+using _216    = Int<216>;
 using _224    = Int<224>;
+using _232    = Int<232>;
 using _240    = Int<240>;
+using _248    = Int<248>;
 using _256    = Int<256>;
 using _384    = Int<384>;
 using _512    = Int<512>;
@@ -406,6 +419,20 @@ conditional_return(false_type, TrueType&&, FalseType&& f) {
   return static_cast<FalseType&&>(f);
 }
 
+template <auto v>
+CUTE_HOST_DEVICE constexpr
+auto
+conditional_return(bool b, C<v> const&, C<v> const&) {
+  return C<v>{};
+}
+
+template <auto v, auto u>
+CUTE_HOST_DEVICE constexpr
+auto
+conditional_return(bool b, C<v> const&, C<u> const&) {
+  return b ? v : u;
+}
+
 // TrueType and FalseType must have a common type
 template <class TrueType, class FalseType>
 CUTE_HOST_DEVICE constexpr
@@ -435,7 +462,7 @@ static_value()
     return Int<Trait::value>{};
   } else {
     return Trait::value;
-  } 
+  }
   CUTE_GCC_UNREACHABLE;
 }
 
@@ -480,7 +507,7 @@ constexpr uint64_t parse_int_digits(uint64_t result, int digit, Ts... digits)
 //  var has type cute::constant<int,32>.
 //
 template <char... digits>
-constexpr cute::constant<int,detail::parse_int_digits(0, (digits - '0')...)> operator "" _c()
+constexpr cute::constant<int,detail::parse_int_digits(0, (digits - '0')...)> operator ""_c()
 {
   static_assert((('0' <= digits && digits <= '9') && ...),
                 "Expected 0 <= digit <= 9 for each digit of the integer.");
diff --git a/include/cute/numeric/integral_ratio.hpp b/include/cute/numeric/integral_ratio.hpp
index 943b004982..1b1432533a 100644
--- a/include/cute/numeric/integral_ratio.hpp
+++ b/include/cute/numeric/integral_ratio.hpp
@@ -30,11 +30,10 @@
  **************************************************************************************************/
 #pragma once
 
-#include <cute/config.hpp>
-
-#include <cute/util/type_traits.hpp>
-#include <cute/numeric/math.hpp>
-#include <cute/numeric/integral_constant.hpp>
+#include <cute/config.hpp>                     // CUTE_HOST_DEVICE
+#include <cute/numeric/integral_constant.hpp>  // cute::false_type, cute::true_type
+#include <cute/numeric/math.hpp>               // cute::signum
+#include <cute/util/type_traits.hpp>           // __CUTE_REQUIRES
 
 namespace cute
 {
diff --git a/include/cute/numeric/math.hpp b/include/cute/numeric/math.hpp
index 6d95165de2..e493a3a953 100644
--- a/include/cute/numeric/math.hpp
+++ b/include/cute/numeric/math.hpp
@@ -30,9 +30,9 @@
  **************************************************************************************************/
 #pragma once
 
-#include <cute/config.hpp>
+#include <cute/config.hpp>            // CUTE_HOST_DEVICE
+#include <cute/util/type_traits.hpp>  // __CUTE_REQUIRES
 
-#include <cute/util/type_traits.hpp>
 #include <cutlass/fast_math.h>
 
 namespace cute
@@ -143,7 +143,7 @@ has_single_bit(T x) {
 // bit_width( 0b0111 ) = 3
 template <class T>
 CUTE_HOST_DEVICE constexpr
-T
+int
 bit_width(T x) {
   static_assert(is_unsigned<T>::value, "Only to be used for unsigned types.");
   constexpr int N = (numeric_limits<T>::digits == 64 ? 6 :
@@ -224,7 +224,7 @@ rotr(T x, int s) {
 // countl_zero( 0b00011100 ) = 3
 template <class T>
 CUTE_HOST_DEVICE constexpr
-T
+int
 countl_zero(T x) {
   return numeric_limits<T>::digits - bit_width(x);
 }
@@ -235,7 +235,7 @@ countl_zero(T x) {
 // countl_one( 0b11100011 ) = 3
 template <class T>
 CUTE_HOST_DEVICE constexpr
-T
+int
 countl_one(T x) {
   return countl_zero(~x);
 }
@@ -246,7 +246,7 @@ countl_one(T x) {
 // countr_zero( 0b00011100 ) = 2
 template <class T>
 CUTE_HOST_DEVICE constexpr
-T
+int
 countr_zero(T x) {
   return x == 0 ? numeric_limits<T>::digits : bit_width(T(x & T(-x))) - 1;  // bit_width of the LSB
 }
@@ -257,7 +257,7 @@ countr_zero(T x) {
 // countr_one( 0b11100011 ) = 2
 template <class T>
 CUTE_HOST_DEVICE constexpr
-T
+int
 countr_one(T x) {
   return countr_zero(~x);
 }
@@ -285,7 +285,7 @@ popcount(T x) {
 // Computes the result of bitwise left-shift
 template <class T>
 CUTE_HOST_DEVICE constexpr
-T
+auto
 shiftl(T x, int s) {
   return s >= 0 ? (x << s) : (x >> -s);
 }
@@ -293,7 +293,7 @@ shiftl(T x, int s) {
 // Computes the result of bitwise right-shift
 template <class T>
 CUTE_HOST_DEVICE constexpr
-T
+auto
 shiftr(T x, int s) {
   return s >= 0 ? (x >> s) : (x << -s);
 }
diff --git a/include/cute/numeric/numeric_types.hpp b/include/cute/numeric/numeric_types.hpp
index fc9f1725c5..3b9e114ebe 100644
--- a/include/cute/numeric/numeric_types.hpp
+++ b/include/cute/numeric/numeric_types.hpp
@@ -30,16 +30,15 @@
  **************************************************************************************************/
 #pragma once
 
+#include <cute/config.hpp>          // CUTE_HOST_DEVICE
+#include <cute/numeric/int.hpp>     // cute::int2_t, cute::int4_t, etc
+
 #if defined(CUTLASS_ENABLE_SYCL)
 #include <cutlass/sycl_vector_types.h>
-#else
-#include <vector_types.h>
 #endif
-#include <cutlass/numeric_types.h>
-#include <cutlass/numeric_size.h>
 
-#include <cute/numeric/int.hpp>
-#include <cute/numeric/real.hpp>
+#include <cutlass/numeric_size.h>   // cutlass::sizeof_bits
+#include <cutlass/numeric_types.h>  // cutlass::float_e4m3_t, cutlass::float_e5m2_t, etc
 
 namespace cute {
 
@@ -76,4 +75,65 @@ using cutlass::int4b_t;
 using cutlass::uint4b_t;
 using cutlass::bin1_t;
 
-} // end namespace cute
+
+//
+// Print utility
+//
+
+CUTE_HOST_DEVICE
+void
+print(half_t a) {
+  printf("%f", static_cast<float>(a));
+}
+
+CUTE_HOST_DEVICE
+void
+print(bfloat16_t a) {
+  printf("%f", static_cast<float>(a));
+}
+
+
+CUTE_HOST_DEVICE
+void
+print(tfloat32_t a) {
+  printf("%f", static_cast<float>(a));
+}
+
+CUTE_HOST_DEVICE
+void
+print(float_e4m3_t a) {
+  printf("%f", static_cast<float>(a));
+}
+
+CUTE_HOST_DEVICE
+void
+print(float_e5m2_t a) {
+  printf("%f", static_cast<float>(a));
+}
+
+CUTE_HOST_DEVICE void
+pretty_print(bfloat16_t v) {
+  printf("%*.2f", 8, float(v));
+}
+
+CUTE_HOST_DEVICE void
+pretty_print(half_t v) {
+  printf("%*.2f", 8, float(v));
+}
+
+CUTE_HOST_DEVICE void
+pretty_print(tfloat32_t v) {
+  printf("%*.2e", 10, static_cast<float>(v));
+}
+
+CUTE_HOST_DEVICE void
+pretty_print(float_e4m3_t t) {
+  printf("%*.2f", 8, static_cast<float>(t));
+}
+
+CUTE_HOST_DEVICE void
+pretty_print(float_e5m2_t t) {
+  printf("%*.2f", 8, static_cast<float>(t));
+}
+
+} // namespace cute
diff --git a/include/cute/numeric/real.hpp b/include/cute/numeric/real.hpp
index f797bc13a1..4ce58dfa18 100644
--- a/include/cute/numeric/real.hpp
+++ b/include/cute/numeric/real.hpp
@@ -35,6 +35,24 @@
 namespace cute
 {
 
+/// Generic add
+template <class C, class A, class B>
+CUTE_HOST_DEVICE constexpr
+void
+add(C& c, A const& a, B const& b)
+{
+  c = a + b;
+}
+
+/// Generic multiply
+template <class C, class A, class B>
+CUTE_HOST_DEVICE constexpr
+void
+mul(C& c, A const& a, B const& b)
+{
+  c = a * b;
+}
+
 /// Generic fused multiply-add
 template <class D, class A, class B, class C>
 CUTE_HOST_DEVICE constexpr
diff --git a/include/cute/pointer.hpp b/include/cute/pointer.hpp
index 604477a0d3..4cfa129cce 100644
--- a/include/cute/pointer.hpp
+++ b/include/cute/pointer.hpp
@@ -30,17 +30,13 @@
  **************************************************************************************************/
 #pragma once
 
-#include <cute/config.hpp>
+#include <cute/config.hpp>                     // CUTE_HOST_DEVICE
+#include <cute/pointer_base.hpp>               // cute::iter_adaptor
+#include <cute/pointer_sparse.hpp>
+#include <cute/container/array_subbyte.hpp>    // cute::subbyte_iterator
+#include <cute/numeric/integral_constant.hpp>  // cute::true_type, cute::false_type
+#include <cute/numeric/numeric_types.hpp>      // sizeof_bits
 
-#include <cute/util/type_traits.hpp>
-#include <cute/numeric/numeric_types.hpp>        // sizeof_bits
-#include <cute/numeric/math.hpp>
-#include <cute/numeric/integral_constant.hpp>
-
-#include <cute/container/array_subbyte.hpp>
-
-#include <cute/pointer_base.hpp>
-#include <cute/pointer_swizzle.hpp>
 namespace cute
 {
 
@@ -50,6 +46,9 @@ namespace cute
 // Subbyte Types: uint2_t, uint4_t, etc
 //   Requires construction of a subbyte_iterator<T> in order to properly
 //   resolve each element in byte-addressed memory.
+// Sparse Types: sparse_elem<int S, class T>
+//   A type that holds one physical element meant to represent S number of logical elements.
+//   Requires construction of a sparse_ptr that emulates access to the S logical elements.
 //
 
 template <class NewT>
@@ -57,6 +56,11 @@ CUTE_HOST_DEVICE constexpr
 auto
 recast_ptr(void* ptr)
 {
+  if constexpr (is_sparse<NewT>::value) {
+    constexpr int sparsity = NewT::sparsity;
+    NewT* p = reinterpret_cast<NewT*>(ptr);
+    return make_sparse_ptr<sparsity>(p);
+  } else
   if constexpr (cute::is_subbyte_v<NewT>) {
     return subbyte_iterator<NewT>(ptr);
   } else {
@@ -70,6 +74,11 @@ CUTE_HOST_DEVICE constexpr
 auto
 recast_ptr(void const* ptr)
 {
+  if constexpr (is_sparse<NewT>::value) {
+    constexpr int sparsity = NewT::sparsity;
+    NewT const* p = reinterpret_cast<NewT const*>(ptr);
+    return make_sparse_ptr<sparsity>(p);
+  } else
   if constexpr (cute::is_subbyte_v<NewT>) {
     return subbyte_iterator<NewT const>(ptr);
   } else {
diff --git a/include/cute/pointer_base.hpp b/include/cute/pointer_base.hpp
index db5d3dcfc4..90ca0ceb6e 100644
--- a/include/cute/pointer_base.hpp
+++ b/include/cute/pointer_base.hpp
@@ -30,10 +30,9 @@
  **************************************************************************************************/
 #pragma once
 
-#include <cute/config.hpp>
-
-#include <cute/util/type_traits.hpp>
-#include <cute/numeric/numeric_types.hpp>        // sizeof_bits
+#include <cute/config.hpp>                 // CUTE_HOST_DEVICE
+#include <cute/numeric/numeric_types.hpp>  // cute::sizeof_bits
+#include <cute/util/type_traits.hpp>       // cute::declval, cute::void_t, etc
 
 namespace cute
 {
diff --git a/include/cute/pointer_flagged.hpp b/include/cute/pointer_flagged.hpp
index 08751eb169..eb8d7e452e 100644
--- a/include/cute/pointer_flagged.hpp
+++ b/include/cute/pointer_flagged.hpp
@@ -30,15 +30,13 @@
  **************************************************************************************************/
 #pragma once
 
-#include <cute/config.hpp>
-
-#include <cute/arch/util.hpp>        // cast_smem_ptr_to_uint
-
-#include <cute/pointer.hpp>
-#include <cute/pointer_swizzle.hpp>
-#include <cute/swizzle_layout.hpp>
-
-#include <cute/tensor.hpp>
+#include <cute/config.hpp>                     // CUTE_HOST_DEVICE
+#include <cute/layout_composed.hpp>            // cute::ComposedLayout
+#include <cute/pointer.hpp>                    // cute::make_smem_ptr
+#include <cute/pointer_sparse.hpp>             // cute::is_sparse
+#include <cute/pointer_swizzle.hpp>            // cute::make_swizzle_ptr
+#include <cute/arch/util.hpp>                  // cute::cast_smem_ptr_to_uint
+#include <cute/numeric/integral_constant.hpp>  // cute::Int
 
 namespace cute
 {
@@ -124,6 +122,47 @@ as_position_independent_swizzle_tensor(Tensor&& tensor)
   CUTE_GCC_UNREACHABLE;
 }
 
+// A model of a nullptr sparse_ptr<S, smem_ptr<T>> with B == sizeof_bits<T>::value
+// That represents an unset pointer. This is a placeholder type that is waiting for an smem_ptr
+template <int Sparsity, int Bits>
+struct smem_sparse_ptr_flag_bits : Int<0> {};
+
+template <int Sparsity>
+using smem_sparse_ptr_flag = smem_sparse_ptr_flag_bits<Sparsity, 1>;
+
+// A flagged construction method to transform ComposedLayout
+// Make a swizzle pointer tensor and check that the intended type size matches
+template <class Iterator, class SwizzleFn, int S, int B, class Layout>
+CUTE_HOST_DEVICE constexpr
+auto
+make_tensor(Iterator const& ptr,
+            ComposedLayout<SwizzleFn,smem_sparse_ptr_flag_bits<S,B>,Layout> const& layout)
+{
+  static_assert(is_smem<Iterator>::value, "Expected smem.");
+  static_assert(is_sparse_ptr<Iterator>::value, "Expected sparse iter");
+  static_assert(is_sparse<iter_value_t<Iterator>>::value, "Expected sparse elem");
+  static_assert(S == iter_value_t<Iterator>::sparsity, "Expected sparsity S");
+  static_assert(B == sizeof_bits<typename iter_value_t<Iterator>::raw_type>::value, "Expected B-bit pointer type");
+  return make_tensor(make_swizzle_ptr(ptr, layout.layout_a()), layout.layout_b());
+}
+
+// NOTE: To preserve smem_ptr_flag_bits under recast ops
+template <int N, class SwizzleFn, int S, int B, class Layout>
+CUTE_HOST_DEVICE constexpr
+auto
+upcast(ComposedLayout<SwizzleFn,smem_sparse_ptr_flag_bits<S,B>,Layout> const& layout)
+{
+  static_assert(dependent_false<SwizzleFn>, "Not implemented for safety");
+}
+
+template <int N, class SwizzleFn, int S, int B, class Layout>
+CUTE_HOST_DEVICE constexpr
+auto
+downcast(ComposedLayout<SwizzleFn,smem_sparse_ptr_flag_bits<S,B>,Layout> const& layout)
+{
+  static_assert(dependent_false<SwizzleFn>, "Not implemented for safety");
+}
+
 //
 // Display utilities
 //
@@ -151,4 +190,10 @@ CUTE_HOST_DEVICE void print(smem_ptr_flag_bits<B> ptr)
   printf("smem_ptr[%db](unset)", B);
 }
 
+template <int S, int B>
+CUTE_HOST_DEVICE void print(smem_sparse_ptr_flag_bits<S,B>)
+{
+  printf("smem_sparse<%d>_ptr[%db](unset)", S, B);
+}
+
 } // end namespace cute
diff --git a/include/cute/pointer_sparse.hpp b/include/cute/pointer_sparse.hpp
new file mode 100644
index 0000000000..ccae458650
--- /dev/null
+++ b/include/cute/pointer_sparse.hpp
@@ -0,0 +1,172 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+#pragma once
+
+#include <cute/config.hpp>                     // CUTE_HOST_DEVICE
+#include <cute/pointer_base.hpp>               // cute::iter_adaptor
+#include <cute/numeric/integral_constant.hpp>  // cute::false_type, cute::true_type
+#include <cute/numeric/integral_ratio.hpp>     // cute::ratio
+
+namespace cute
+{
+
+// A data type that holds one physical element meant to represent Sparsity number of logical elements
+// This class is purposely not compatible with anything -- know what you're doing if you attempt to use it
+template <int Sparsity, class T>
+struct sparse_elem
+{
+  static constexpr int sparsity = Sparsity;
+  using raw_type = T;
+  T elem_;
+
+  CUTE_HOST_DEVICE constexpr
+  explicit sparse_elem(T const& elem = {}) : elem_(elem) {}
+
+  CUTE_HOST_DEVICE constexpr friend bool operator==(sparse_elem const& a, sparse_elem const& b) { return a.elem_ == b.elem_; }
+  CUTE_HOST_DEVICE constexpr friend bool operator!=(sparse_elem const& a, sparse_elem const& b) { return a.elem_ != b.elem_; }
+  CUTE_HOST_DEVICE constexpr friend bool operator< (sparse_elem const& a, sparse_elem const& b) { return a.elem_ <  b.elem_; }
+  CUTE_HOST_DEVICE constexpr friend bool operator<=(sparse_elem const& a, sparse_elem const& b) { return a.elem_ <= b.elem_; }
+  CUTE_HOST_DEVICE constexpr friend bool operator> (sparse_elem const& a, sparse_elem const& b) { return a.elem_ >  b.elem_; }
+  CUTE_HOST_DEVICE constexpr friend bool operator>=(sparse_elem const& a, sparse_elem const& b) { return a.elem_ >= b.elem_; }
+};
+
+template <class T>
+struct is_sparse : false_type {};
+template <class T>
+struct is_sparse<T const> : is_sparse<T> {};
+template <int S, class T>
+struct is_sparse<sparse_elem<S,T>> : true_type {};
+template<class T>
+static constexpr auto is_sparse_v = is_sparse<T>::value;
+
+// Overload sizeof_bits for sparse_elem.
+//   Much like subbyte element types, this is the effective number of bits in a sparse_elem
+//   rather than actual physical bits that may be used in storing one. Also like subbyte element
+//   types, modified iterators are required to properly index and access sparse_elems.
+//
+//   Defining sizeof_bits like this makes reasonable expressions like N * sizeof_bits_v<E> meaningful
+//   even when E is subbyte or sparse. However, this also means that sparse_elem can rather easily be
+//   confused with subbyte elements and special care should be taken with each.
+template <int S, class T>
+struct sizeof_bits<sparse_elem<S,T>> {
+  // Simple implementation that conforms to sizeof_bits
+  //static constexpr auto value = sizeof_bits<T>::value / S;
+  //static_assert(value != 0, "sizeof_bits=0 detected. Sparsity is larger than width.");
+  //static_assert((sizeof_bits<T>::value % S) == 0, "Width needs to be a multiple of sparsity.")
+
+  // Interesting experiment that allows any sparsity level to be used by potentially presenting
+  // an integral_ratio rather than size_t. This is valid in most integer expressions as well.
+  static constexpr auto value = cute::ratio(cute::Int<cute::sizeof_bits_v<T>>{}, cute::Int<S>{});
+};
+
+//
+// sparse_ptr
+//
+
+template <class T, class = void>
+struct is_sparse_ptr : false_type {};
+template <class T>
+struct is_sparse_ptr<T, void_t<typename T::iterator>> : is_sparse_ptr<typename T::iterator> {};
+
+template <int Sparsity, class Iterator>
+struct sparse_ptr : iter_adaptor<Iterator, sparse_ptr<Sparsity, Iterator>>
+{
+  using reference    = typename iterator_traits<Iterator>::reference;
+  using element_type = typename iterator_traits<Iterator>::element_type;
+  using value_type   = typename iterator_traits<Iterator>::value_type;
+
+  // Sanity, for now
+  static_assert(is_sparse<value_type>::value, "Enforce sparse value-type");
+  static_assert(Sparsity == iter_value_t<Iterator>::sparsity, "Enforce sparsity S");
+  static_assert(not is_sparse_ptr<Iterator>::value, "Enforce sparse singleton");
+
+  template <class Index>
+  CUTE_HOST_DEVICE constexpr
+  sparse_ptr operator+(Index const& i) const {
+    // Only allow offset by multiples of the sparsity factor,
+    // else the misalignments become a bug. E.g. (sparse_ptr<8,I>{} + 7) + 7
+    // Motivation for subsparse_iterator or generalization of subbyte_iterator?
+    assert(i % Sparsity == 0);
+    return {this->get() + i / Sparsity};
+  }
+
+  template <class Index>
+  CUTE_HOST_DEVICE constexpr
+  reference operator[](Index const& i) const {
+    // Allow offset by any value and dereference.
+    // Not implemented in terms of sparse_ptr::op+()
+    return *(this->get() + i / Sparsity);
+  }
+};
+
+template <int S, class I>
+struct is_sparse_ptr<sparse_ptr<S,I>> : true_type {};
+
+template <int Sparsity, class Iter>
+CUTE_HOST_DEVICE constexpr
+auto
+make_sparse_ptr(Iter const& iter) {
+  if constexpr (Sparsity == 1) {
+    return iter;
+  } else {
+    return sparse_ptr<Sparsity, Iter>{iter};
+  }
+  CUTE_GCC_UNREACHABLE;
+}
+
+template <class NewT, int S, class Iter>
+CUTE_HOST_DEVICE constexpr
+auto
+recast_ptr(sparse_ptr<S,Iter> const& ptr) {
+  static_assert(not is_sparse<NewT>::value);
+  return recast_ptr<NewT>(ptr.get());
+}
+
+//
+// Display utilities
+//
+
+template <int S, class Iter>
+CUTE_HOST_DEVICE void print(sparse_ptr<S,Iter> ptr)
+{
+  printf("sparse<%d>_", S); print(ptr.get());
+}
+
+#if !defined(__CUDACC_RTC__)
+template <int S, class Iter>
+CUTE_HOST std::ostream& operator<<(std::ostream& os, sparse_ptr<S,Iter> ptr)
+{
+  return os << "sparse<" << S << ">_" << ptr.get();
+}
+#endif
+
+} // end namespace cute
diff --git a/include/cute/pointer_swizzle.hpp b/include/cute/pointer_swizzle.hpp
index a83b485c8e..720b9b1246 100644
--- a/include/cute/pointer_swizzle.hpp
+++ b/include/cute/pointer_swizzle.hpp
@@ -30,13 +30,11 @@
  **************************************************************************************************/
 #pragma once
 
-#include <cute/config.hpp>
-
-#include <cute/util/type_traits.hpp>  // iterator_traits
-#include <cute/container/array_subbyte.hpp>
-
-#include <cute/pointer_base.hpp>
-#include <cute/swizzle.hpp>
+#include <cute/config.hpp>                   // CUTE_HOST_DEVICE
+#include <cute/pointer_base.hpp>             // cute::iter_adaptor
+#include <cute/swizzle.hpp>                  // cute::Swizzle, cute::get_swizzle primary template
+#include <cute/util/type_traits.hpp>         // cute::iterator_traits
+#include <cute/container/array_subbyte.hpp>  // cute::subbyte_iterator
 
 /* This implements a swizzle pointer of the form
  *   InvolutionFn o PtrAdd
@@ -107,16 +105,14 @@ struct swizzle_ptr : iter_adaptor<Iterator,swizzle_ptr<SwizzleFn,Iterator>>
   }
 };
 
-template <class T, class = void>                      // Default No-Swizzle
-struct get_swizzle { using type = Swizzle<0,4,3>; };
+//
+// Helper Function
+//
 template <class SwizzleFn, class P>                   // Found the SwizzleFn
 struct get_swizzle<swizzle_ptr<SwizzleFn,P>> { using type = SwizzleFn; };
 template <class T>                                    // Recurse into anything with a ::iterator
 struct get_swizzle<T, void_t<typename T::iterator>> : get_swizzle<typename T::iterator> {};
 
-template <class Iter>
-using get_swizzle_t = typename get_swizzle<Iter>::type;
-
 template <class Iterator, class SwizzleFn>
 CUTE_HOST_DEVICE constexpr
 swizzle_ptr<SwizzleFn,Iterator>
diff --git a/include/cute/stride.hpp b/include/cute/stride.hpp
index 09a02a00e7..f2d31f4e34 100644
--- a/include/cute/stride.hpp
+++ b/include/cute/stride.hpp
@@ -30,10 +30,16 @@
  **************************************************************************************************/
 #pragma once
 
-#include <cute/config.hpp>
-#include <cute/int_tuple.hpp>
-#include <cute/numeric/int.hpp>
-#include <cute/numeric/math.hpp>
+#include <cute/config.hpp>                     // CUTE_HOST_DEVICE
+#include <cute/util/type_traits.hpp>           // cute::__CUTE_REQUIRES
+#include <cute/container/tuple.hpp>            // cute::is_tuple
+#include <cute/numeric/integral_constant.hpp>  // cute::is_integral
+#include <cute/numeric/integer_sequence.hpp>   // cute::seq
+#include <cute/numeric/math.hpp>               // cute::divmod
+#include <cute/numeric/arithmetic_tuple.hpp>   // cute::basis_get
+#include <cute/algorithm/functional.hpp>       // cute::identity
+#include <cute/algorithm/tuple_algorithms.hpp> // cute::fold
+#include <cute/int_tuple.hpp>                  // cute::is_congruent
 
 namespace cute
 {
@@ -433,7 +439,7 @@ compact_order(Shape const& shape, Order const& order)
   auto flat_order = flatten_to_tuple(order);
   // Find the largest static element of order
   auto max_order = cute::fold(flat_order, Int<0>{}, [](auto v, auto order) {
-    if constexpr (is_constant<true, decltype(v < order)>::value) {  
+    if constexpr (is_constant<true, decltype(v < order)>::value) {
       return order;
     } else {
       return v;
@@ -474,4 +480,119 @@ compact_order(Shape const& shape, GenRowMajor const& major)
   return compact_major<LayoutRight>(shape);
 }
 
+//
+// Coordinate iterator
+//
+
+namespace detail {
+
+template <class Coord, class Shape, class Order>
+CUTE_HOST_DEVICE constexpr
+void
+increment(Coord& coord, Shape const& shape, Order const& order)
+{
+  ++basis_get(get<0>(order), coord);
+  cute::for_each(make_range<1, tuple_size<Order>::value>{}, [&](auto i){
+    if (basis_get(get<i-1>(order), coord) == basis_get(get<i-1>(order), shape)) {
+      basis_get(get<i-1>(order), coord) = 0;
+      ++basis_get(get<i>(order), coord);
+    }
+  });
+}
+
+/** Increment a (dynamic) coord colexicographically within a shape
+ * @pre is_congruent<Coord,Shape>::value
+ * \code
+ *   auto shape = make_shape(1,2,make_shape(2,3),3);
+ *   auto coord = repeat_like(shape, 0);
+ *
+ *   for (int i = 0; i < size(shape); ++i) {
+ *     std::cout << i << ": " << coord << std::endl;
+ *     increment(coord, shape);
+ *   }
+ * \endcode
+ */
+template <class Coord, class Shape>
+CUTE_HOST_DEVICE constexpr
+void
+increment(Coord& coord, Shape const& shape)
+{
+  increment(coord, shape, flatten_to_tuple(make_basis_like(shape)));
+}
+
+} // end namespace detail
+
+struct ForwardCoordIteratorSentinel
+{};
+
+// A forward iterator for a starting coordinate in a shape's domain, and a shape.
+// The starting coordinate may be zero but need not necessarily be.
+template <class Coord, class Shape, class Order>
+struct ForwardCoordIterator
+{
+  static_assert(is_congruent<Coord, Shape>::value);
+
+  CUTE_HOST_DEVICE constexpr
+  Coord const& operator*() const { return coord; }
+  CUTE_HOST_DEVICE constexpr
+  ForwardCoordIterator& operator++() { detail::increment(coord, shape, Order{}); return *this; }
+  // Sentinel for the end of the implied range
+  CUTE_HOST_DEVICE constexpr
+  bool operator==(ForwardCoordIteratorSentinel const&) const { return basis_get(back(Order{}), coord) == basis_get(back(Order{}), shape); }
+  CUTE_HOST_DEVICE constexpr
+  bool operator!=(ForwardCoordIteratorSentinel const&) const { return basis_get(back(Order{}), coord) != basis_get(back(Order{}), shape); }
+  // NOTE: These are expensive, avoid use
+  CUTE_HOST_DEVICE constexpr
+  bool operator==(ForwardCoordIterator const& other) const { return coord == other.coord; }
+  CUTE_HOST_DEVICE constexpr
+  bool operator!=(ForwardCoordIterator const& other) const { return coord != other.coord; }
+
+  Coord coord;
+  Shape const& shape;
+};
+
+// A forward iterator for a coordinate that starts from a provided coordinate and increments in a prescribed order
+template <class Order, class Shape, class Coord>
+CUTE_HOST_DEVICE constexpr
+auto
+make_coord_iterator(Coord const& coord, Shape const& shape)
+{
+  static_assert(is_congruent<Coord, Shape>::value);
+  static_assert(is_congruent<Order, Coord>::value);
+  static_assert(is_congruent<Order, Shape>::value);
+  auto flat_order  = flatten_to_tuple(Order{});
+  auto inv_order   = transform(make_seq<rank(flat_order)>{}, [&](auto i){ return find(flat_order, i); });
+  auto basis_order = transform_leaf(inv_order, [&](auto i) { return get<i>(flatten_to_tuple(make_basis_like(shape))); });
+  return ForwardCoordIterator<Coord,Shape,decltype(basis_order)>{coord,shape};
+}
+
+// A forward iterator for a coordinate that starts from a provided coordinate and increments colex
+template <class Shape, class Coord>
+CUTE_HOST_DEVICE constexpr
+auto
+make_coord_iterator(Coord const& coord, Shape const& shape)
+{
+  static_assert(is_congruent<Coord, Shape>::value);
+  auto basis_order = flatten_to_tuple(make_basis_like(shape));
+  return ForwardCoordIterator<Coord,Shape,decltype(basis_order)>{coord,shape};
+}
+
+// A forward iterator for a coordinate that starts from zero and increments in a prescribed order
+template <class Order, class Shape>
+CUTE_HOST_DEVICE constexpr
+auto
+make_coord_iterator(Shape const& shape)
+{
+  return make_coord_iterator<Order>(repeat_like(shape, int(0)), shape);
+}
+
+// A forward iterator for a coordinate that starts from zero and increments colex
+template <class Shape>
+CUTE_HOST_DEVICE constexpr
+auto
+make_coord_iterator(Shape const& shape)
+{
+  return make_coord_iterator(repeat_like(shape, int(0)), shape);
+}
+
 } // end namespace cute
diff --git a/include/cute/swizzle.hpp b/include/cute/swizzle.hpp
index 9ceb0d32b0..52abf856dd 100644
--- a/include/cute/swizzle.hpp
+++ b/include/cute/swizzle.hpp
@@ -30,13 +30,11 @@
  **************************************************************************************************/
 #pragma once
 
-#include <cute/config.hpp>
-
-#include <cute/container/tuple.hpp>
-#include <cute/algorithm/tuple_algorithms.hpp>
-#include <cute/numeric/integer_sequence.hpp>
-#include <cute/numeric/integral_constant.hpp>
-#include <cute/numeric/math.hpp>
+#include <cute/config.hpp>                      // CUTE_HOST_DEVICE
+#include <cute/container/tuple.hpp>             // cute::is_tuple
+#include <cute/numeric/integral_constant.hpp>   // cute::constant
+#include <cute/numeric/math.hpp>                // cute::max, cute::min
+#include <cute/algorithm/tuple_algorithms.hpp>  // cute::transform_apply
 
 namespace cute
 {
@@ -488,4 +486,13 @@ CUTE_HOST std::ostream& operator<<(std::ostream& os, MixedBits<S,F> const& m)
 }
 #endif // !defined(__CUDACC_RTC__)
 
+//
+// Helper Function
+//
+template <class T, class = void>                      // Default No-Swizzle
+struct get_swizzle { using type = Swizzle<0,4,3>; };
+
+template <class T>
+using get_swizzle_t = typename get_swizzle<T>::type;
+
 } // end namespace cute
diff --git a/include/cute/swizzle_layout.hpp b/include/cute/swizzle_layout.hpp
index 82e51c79c6..1324360eba 100644
--- a/include/cute/swizzle_layout.hpp
+++ b/include/cute/swizzle_layout.hpp
@@ -30,13 +30,10 @@
  **************************************************************************************************/
 #pragma once
 
-#include <cute/config.hpp>
-
-#include <cute/layout.hpp>
-#include <cute/layout_composed.hpp>
-
-#include <cute/swizzle.hpp>
-#include <cute/pointer_swizzle.hpp>  // get_swizzle
+#include <cute/config.hpp>           // CUTE_HOST_DEVICE
+#include <cute/layout.hpp>           // cute::Layout
+#include <cute/layout_composed.hpp>  // cute::ComposedLayout
+#include <cute/swizzle.hpp>          // cute::Swizzle, cute::get_swizzle primary template
 
 /* Specialized functionality for a ComposedLayout of the form
  *   InvolutionFn o Offset o LayoutB
@@ -57,6 +54,9 @@
 namespace cute
 {
 
+//
+// Helper Function
+//
 template <int B, int M, int S, class Offset, class LayoutB>
 struct get_swizzle<ComposedLayout<Swizzle<B,M,S>,Offset,LayoutB>> { using type = Swizzle<B,M,S>; };
 
@@ -193,7 +193,7 @@ make_swizzle_strides(true_type,
   //   0  Z  DC
   //   1 -Z  DC
 
-  return cute::make_tuple(conditional_return((offset & (Y << Int<I>{})) == Int<0>{}, Z << Int<I>{}, -(Z << Int<I>{}))...);
+  return cute::make_tuple(conditional_return((offset & (Y << Int<I>{})) == Int<0>{}, Z * Int<(1 << I)>{}, -Z * Int<(1 << I)>{})...);
 }
 
 template <class IntZ, class IntY, class Offset, int... I>
@@ -214,7 +214,7 @@ make_swizzle_strides(false_type,
   //   0 Y+Z Y-Z
   //   1 DC  DC
 
-  return cute::make_tuple(conditional_return((offset & (Z << Int<I>{})) == Int<0>{}, (Y+Z) << Int<I>{}, (Y-Z) << Int<I>{})...);
+  return cute::make_tuple(conditional_return((offset & (Z << Int<I>{})) == Int<0>{}, (Y+Z) * Int<(1 << I)>{}, (Y-Z) * Int<(1 << I)>{})...);
 }
 
 } // end namespace detail
@@ -240,16 +240,6 @@ slice_and_offset(Coord const& coord, ComposedLayout<Swizzle<B,M,S>,Offset,Layout
     // The portion of the layout that is not yet consumed
     auto sliced_layout = slice(coord, layout.layout_b());
 
-    // If the sliced_layout hits two bits that are swizzled together, then don't attempt to decay
-
-    // Compose with the layout to get the swizzle projection, P o L  [The Z and Y contributing portions of L]
-    //   (this also tests that shape/stride of layout compose with swizzle)
-    auto sliced_layout_only_zy = composition(swizzle_only_zy, sliced_layout);
-    // Transform the end coordinate to get the active bits of the swizzle, (P o L)(c*)
-    auto swizzle_active_bits = sliced_layout_only_zy(size(sliced_layout_only_zy)-Int<1>{});
-    // Determine if any active bits collide under the swizzle
-    auto hit_ZandY = !(swizzle_active_bits & ~layout.layout_a()(swizzle_active_bits));
-
     // The portion of the layout that we are consuming now
     auto diced_layout = dice(coord, layout.layout_b());
     auto diced_coord  = dice(coord, coord);
@@ -269,8 +259,16 @@ slice_and_offset(Coord const& coord, ComposedLayout<Swizzle<B,M,S>,Offset,Layout
     // If Layout's codomain hits on         Y XOR Z, then it's dynamic-normal
     // If Layout's codomain hits on neither Y NOR Z, then it's static-normal
 
-    // Test the sliced layout for hit_X & hit_Y for potential decay
-    if constexpr (is_constant<false, decltype(hit_ZandY)>::value)
+    // If the sliced_layout hits two bits that are swizzled together, then don't attempt to decay
+
+    // Compose with the layout to get the swizzle projection, P o L  [The Z and Y contributing portions of L]
+    //   (this also tests that shape/stride of layout compose with swizzle)
+    auto sliced_layout_only_zy = composition(swizzle_only_zy, sliced_layout);
+    // Transform the end coordinate to get the active bits of the swizzle, (P o L)(c*)
+    [[maybe_unused]] auto swizzle_active_bits = sliced_layout_only_zy(size(sliced_layout_only_zy)-Int<1>{});
+
+    // Determine if any active bits collide under the swizzle for potential decay
+    if constexpr (is_constant<0, decltype(not (swizzle_active_bits & ~swizzle(swizzle_active_bits)))>::value)
     { // Hits on Y AND Z, so it's not reducible
       return cute::make_tuple(composition(swizzle, offset_only_zy, sliced_layout), offset_anti_zy);
     } else
@@ -459,7 +457,7 @@ CUTE_HOST_DEVICE constexpr
 auto
 max_alignment(Swizzle<B,M,S> const&)
 {
-  return Int<M>{};
+  return Int<1 << M>{};
 }
 
 template <int B, int M, int S, class Offset, class LayoutB>
diff --git a/include/cute/tensor.hpp b/include/cute/tensor.hpp
index a45cbd0132..3f3335b63d 100644
--- a/include/cute/tensor.hpp
+++ b/include/cute/tensor.hpp
@@ -37,7 +37,10 @@
 //
 
 #include <cute/pointer_swizzle.hpp>
+#include <cute/pointer_sparse.hpp>
 #include <cute/pointer_flagged.hpp>
+#include <cute/tensor_zip.hpp>
+
 //
 // Tensor Algorithms
 //
diff --git a/include/cute/tensor_impl.hpp b/include/cute/tensor_impl.hpp
index da0e245636..61eefc5060 100644
--- a/include/cute/tensor_impl.hpp
+++ b/include/cute/tensor_impl.hpp
@@ -41,18 +41,16 @@
 
 #pragma once
 
-#include <cute/config.hpp>
-
-#include <cute/util/type_traits.hpp>
-#include <cute/numeric/integral_constant.hpp>
-#include <cute/numeric/integer_sequence.hpp>
-
-#include <cute/container/tuple.hpp>
-#include <cute/container/array_aligned.hpp>
-#include <cute/container/array_subbyte.hpp>
-
-#include <cute/pointer.hpp>
-#include <cute/layout.hpp>
+#include <cute/config.hpp>                     // CUTE_HOST_DEVICE
+#include <cute/layout.hpp>                     // cute::Shape
+#include <cute/layout_composed.hpp>            // cute::is_composed_layout
+#include <cute/pointer.hpp>                    // cute::recast_ptr
+#include <cute/pointer_base.hpp>               // cute::iterator_traits
+#include <cute/container/array_aligned.hpp>    // cute::array_aligned
+#include <cute/container/array_subbyte.hpp>    // cute::array_subbyte
+#include <cute/container/tuple.hpp>            // cute::tuple
+#include <cute/numeric/integral_constant.hpp>  // cute::is_integral
+#include <cute/util/type_traits.hpp>           // __CUTE_REQUIRES
 
 namespace cute
 {
@@ -69,7 +67,7 @@ namespace cute
 //   iterator begin();
 // };
 
-template <class T, int N>
+template <class T, size_t N>
 struct ArrayEngine
 {
   using Storage = typename conditional<(sizeof_bits<T>::value % 8 == 0),
@@ -85,6 +83,24 @@ struct ArrayEngine
   CUTE_HOST_DEVICE constexpr auto begin()       { return storage_.begin(); }
 };
 
+// Specialization for sparse_elem<S,T> tensor allocation/iteration
+template <int S, class T, size_t N>
+struct ArrayEngine<sparse_elem<S,T>, N>
+{
+  static_assert(N % S == 0, "Expected a multiple of the sparsity.");
+  using value_type   = sparse_elem<S,T>;
+  using Storage      = typename conditional<(sizeof_bits<T>::value % 8 == 0),
+                                            array_aligned<T,N/S>,
+                                            array_subbyte<T,N/S>>::type;
+  using iterator     = sparse_ptr<S,sparse_elem<S,T>*>;
+  using reference    = typename iterator_traits<iterator>::reference;
+  using element_type = typename iterator_traits<iterator>::element_type;
+  Storage storage_;
+
+  CUTE_HOST_DEVICE constexpr auto begin() const { return recast_ptr<value_type>(storage_.begin()); }
+  CUTE_HOST_DEVICE constexpr auto begin()       { return recast_ptr<value_type>(storage_.begin()); }
+};
+
 template <class Iterator>
 struct ViewEngine
 {
@@ -622,6 +638,30 @@ filter_zeros(Tensor<Engine,Layout>&& tensor) {
   return make_tensor(tensor.data(), filter_zeros(tensor.layout()));
 }
 
+template <class Engine, class Layout, class Profile>
+CUTE_HOST_DEVICE constexpr
+auto
+filter_zeros(Tensor<Engine,Layout> const& tensor, Profile const& profile)
+{
+  return make_tensor(tensor.data(), filter_zeros(tensor.layout(), profile));
+}
+
+template <class Engine, class Layout, class Profile>
+CUTE_HOST_DEVICE constexpr
+auto
+filter_zeros(Tensor<Engine,Layout>& tensor, Profile const& profile)
+{
+  return make_tensor(tensor.data(), filter_zeros(tensor.layout(), profile));
+}
+
+template <class Engine, class Layout, class Profile>
+CUTE_HOST_DEVICE constexpr
+auto
+filter_zeros(Tensor<Engine,Layout>&& tensor, Profile const& profile)
+{
+  return make_tensor(tensor.data(), filter_zeros(tensor.layout(), profile));
+}
+
 // Remove all of the 0-strides and 1-sizes
 template <class Engine, class Layout>
 CUTE_HOST_DEVICE constexpr
@@ -755,10 +795,10 @@ auto
 max_common_vector(Tensor<SrcEngine,SrcLayout> const& a,
                   Tensor<DstEngine,DstLayout> const& b)
 {
-  using SrcType = typename Tensor<SrcEngine,SrcLayout>::value_type;
-  using DstType = typename Tensor<DstEngine,DstLayout>::value_type;
-  using SrcRef  = typename Tensor<SrcEngine,SrcLayout>::reference;
-  using DstRef  = typename Tensor<SrcEngine,SrcLayout>::reference;
+  using SrcType = typename SrcEngine::value_type;
+  using SrcRef  = typename SrcEngine::reference;
+  using DstType = typename DstEngine::value_type;
+  using DstRef  = typename DstEngine::reference;
 
   // Determine if vectorization candidates at all
   if constexpr (// Should be the same value_types, else the copy is also performing a cast
@@ -795,10 +835,10 @@ auto
 max_common_layout(Tensor<SrcEngine,SrcLayout> const& a,
                   Tensor<DstEngine,DstLayout> const& b)
 {
-  using SrcType = typename Tensor<SrcEngine,SrcLayout>::value_type;
-  using DstType = typename Tensor<DstEngine,DstLayout>::value_type;
-  using SrcRef  = typename Tensor<SrcEngine,SrcLayout>::reference;
-  using DstRef  = typename Tensor<SrcEngine,SrcLayout>::reference;
+  using SrcType = typename SrcEngine::value_type;
+  using SrcRef  = typename SrcEngine::reference;
+  using DstType = typename DstEngine::value_type;
+  using DstRef  = typename DstEngine::reference;
 
   // Determine if vectorization candidates at all
   if constexpr (// Should be the same value_types, else the copy is also performing a cast
diff --git a/include/cute/tensor_predicate.hpp b/include/cute/tensor_predicate.hpp
index 6814647071..9c8a2ba614 100644
--- a/include/cute/tensor_predicate.hpp
+++ b/include/cute/tensor_predicate.hpp
@@ -30,9 +30,8 @@
  **************************************************************************************************/
 #pragma once
 
-#include <cute/config.hpp>
-
-#include <cute/numeric/integral_constant.hpp>
+#include <cute/config.hpp>                    // CUTE_HOST_DEVICE
+#include <cute/numeric/integral_constant.hpp> // cute::true_type
 
 namespace cute
 {
diff --git a/include/cute/tensor_zip.hpp b/include/cute/tensor_zip.hpp
new file mode 100644
index 0000000000..6d70ffc847
--- /dev/null
+++ b/include/cute/tensor_zip.hpp
@@ -0,0 +1,243 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+#pragma once
+
+#include <cute/config.hpp>           // CUTE_HOST_DEVICE
+#include <cute/tensor_impl.hpp>      // cute::Tensor
+#include <cute/container/tuple.hpp>  // cute::tuple
+
+namespace cute
+{
+
+// A tuple of Iterators that can be offset asymmetrically
+// Note that this only accepts op+(tuple<Index...>) and op[tuple<Index...>]
+//   where each iterator will be offset by its respective index only.
+// READ-ONLY for now until cute::tuple can be constructed with references.
+template <class... Iters>
+struct ZipIterator
+{
+  using value_type   = cute::tuple<iter_value_t<Iters>...>;
+  using element_type = cute::tuple<iter_element_t<Iters>...>;
+  // NOTE: cute::tuple does not support constructions with references at the moment.
+  //       Consider fixes and/or an implementation of std::forward_as_tuple.
+  //       For now, use a cute::tuple of value_types instead, which makes this Iterator READ-ONLY.
+  //using reference    = cute::tuple<iter_reference_t<Iters>...>;
+  using reference  = value_type;
+
+  ZipIterator() = delete;
+
+  CUTE_HOST_DEVICE constexpr
+  ZipIterator(Iters... iters)
+    : iters_(iters...)
+  {}
+
+  CUTE_HOST_DEVICE constexpr
+  ZipIterator(cute::tuple<Iters...> const& iters)
+    : iters_(iters)
+  {}
+
+  CUTE_HOST_DEVICE constexpr
+  reference operator*() const {
+    return cute::apply(iters_, [](auto&&... args) { return reference(*args...); });
+  }
+
+  template <class... Index>
+  CUTE_HOST_DEVICE constexpr
+  ZipIterator operator+(cute::tuple<Index...> const& idxs) const {
+    static_assert(sizeof...(Index) == sizeof...(Iters), "Expect same number of offsets as iterators.");
+    return cute::transform(iters_, idxs, [](auto&& iter, auto&& idx) { return iter + idx; });
+  }
+
+  template <class... Index>
+  CUTE_HOST_DEVICE constexpr
+  reference operator[](cute::tuple<Index...> const& idxs) const {
+    return *(*this + idxs);
+  }
+
+  cute::tuple<Iters...> iters_;
+};
+
+//------------------------------------------------------------------------------
+// type traits
+
+template <class... Iters>
+struct is_rmem<ZipIterator<Iters...>> : conjunction<is_rmem<Iters>...> {};
+template <class... Iters>
+struct is_smem<ZipIterator<Iters...>> : conjunction<is_smem<Iters>...> {};
+template <class... Iters>
+struct is_gmem<ZipIterator<Iters...>> : conjunction<is_gmem<Iters>...> {};
+// A tuple of Layouts that operates on each Layout symmetrically
+// The Layouts need to have compatible shapes and ranks.
+// The ZipLayout presents the intersection of the domain of its component Layouts.
+//   E.g. all Layouts accept 1D coords and ZipLayout does as well.
+// The ZipLayout returns the union of the codomain of its component Layouts.
+//   E.g. all Layouts return an integer so ZipLayout returns a tuple of integers.
+template <class... Layouts>
+struct ZipLayout
+{
+  static constexpr int rank = (int(0) | ... | Layouts::rank);
+
+  static_assert((is_layout<Layouts>::value && ...), "All template parameters must be layouts");
+  static_assert(((Layouts::rank == rank) && ...),   "All layouts must have the same rank");
+
+  CUTE_HOST_DEVICE constexpr
+  ZipLayout(Layouts const&... layouts)
+    : layouts_(layouts...)
+  {}
+
+  CUTE_HOST_DEVICE constexpr
+  ZipLayout(cute::tuple<Layouts...> const& layouts)
+    : layouts_(layouts)
+  {}
+
+  template <class Coord>
+  CUTE_HOST_DEVICE constexpr
+  auto
+  operator()(Coord const& coord) const {
+    if constexpr (has_underscore<Coord>::value) {
+      return ZipLayout(cute::transform(layouts_, [&] (auto layout) { return layout(coord); }));
+    } else {
+      return cute::transform(layouts_, [&] (auto layout) { return layout(coord); });
+    }
+
+    CUTE_GCC_UNREACHABLE;
+  }
+
+  // op() convenience function for multi-dimensional coordinates
+  template <class Coord0, class Coord1, class... Coords>
+  CUTE_HOST_DEVICE constexpr
+  decltype(auto)
+  operator()(Coord0 const& c0, Coord1 const& c1, Coords const&... cs) const {
+    return operator()(make_coord(c0,c1,cs...));
+  }
+
+  cute::tuple<Layouts...> layouts_;
+};
+
+template <class... Layouts>
+struct is_layout<ZipLayout<Layouts...>> : true_type {};
+
+//
+// make_zip_tensor and unzip_tensor
+//
+
+template <class... Engines, class... Layouts>
+CUTE_HOST_DEVICE constexpr
+auto
+make_zip_tensor(Tensor<Engines,Layouts> const&... tensors)
+{
+  return make_tensor(ZipIterator(tensors.data()...),
+                     ZipLayout(tensors.layout()...));
+}
+
+template <class Engine, class Layout>
+CUTE_HOST_DEVICE constexpr
+auto
+unzip_tensor(Tensor<Engine,Layout> const& tensor)
+{
+  return cute::transform(tensor.data().iters_, tensor.layout().layouts_,
+                         [](auto iter, auto layout) { return make_tensor(iter, layout); });
+}
+
+//
+// Utilities
+//
+
+template <int... Is, class... Layouts>
+CUTE_HOST_DEVICE constexpr
+auto
+rank(ZipLayout<Layouts...> const& layouts)
+{
+  return rank<Is...>(get<0>(layouts.layouts_));
+}
+
+template <int... Is, class... Layouts>
+CUTE_HOST_DEVICE constexpr
+auto
+size(ZipLayout<Layouts...> const& layouts)
+{
+  return size<Is...>(get<0>(layouts.layouts_));
+}
+
+//
+// Manipulation
+//
+
+// Extend each component layout to rank-N by appending Layout @a x.
+template <int N, class... Layouts, class ShapeX = _1, class StrideX = _0>
+CUTE_HOST_DEVICE constexpr
+auto
+append(ZipLayout<Layouts...>  const& layouts,
+       Layout<ShapeX,StrideX> const& x = {})
+{
+  return ZipLayout(cute::transform(layouts.layouts_, [&](auto t){ return append<N>(t, x); }));
+}
+
+// Extend each component layout to rank-N by prepending Layout @a x.
+template <int N, class... Layouts, class ShapeX = _1, class StrideX = _0>
+CUTE_HOST_DEVICE constexpr
+auto
+prepend(ZipLayout<Layouts...>  const& layouts,
+        Layout<ShapeX,StrideX> const& x = {})
+{
+  return ZipLayout(cute::transform(layouts.layouts_, [&](auto t){ return prepend<N>(t, x); }));
+}
+
+template <class... Layouts, class Tiler>
+CUTE_HOST_DEVICE constexpr
+auto
+logical_divide(ZipLayout<Layouts...> const& layouts,
+               Tiler                 const& tiler)
+{
+  return ZipLayout(cute::transform(layouts.layouts_, [&](auto t){ return logical_divide(t, tiler); }));
+}
+
+template <class... Layouts, class Tiler>
+CUTE_HOST_DEVICE constexpr
+auto
+zipped_divide(ZipLayout<Layouts...> const& layouts,
+              Tiler                 const& tiler)
+{
+  return ZipLayout(cute::transform(layouts.layouts_, [&](auto t){ return zipped_divide(t, tiler); }));
+}
+
+// Return <SlicedZipLayout, ZipOffsets> by calling slice_and_offset and all component layouts.
+template <class Coord, class... Layouts>
+CUTE_HOST_DEVICE constexpr
+auto
+slice_and_offset(Coord const& c, ZipLayout<Layouts...> const& layouts)
+{
+  auto result = cute::zip(cute::transform(layouts.layouts_, [&c](auto const& layout) { return slice_and_offset(c, layout); }));
+  return cute::make_tuple(ZipLayout(get<0>(result)), get<1>(result));
+}
+
+} // end namespace cute
diff --git a/include/cute/underscore.hpp b/include/cute/underscore.hpp
index 212f42d7fa..e9d80fe5b5 100644
--- a/include/cute/underscore.hpp
+++ b/include/cute/underscore.hpp
@@ -30,12 +30,9 @@
  **************************************************************************************************/
 #pragma once
 
-#include <cute/config.hpp>
-
-#include <cute/container/tuple.hpp>
-#include <cute/algorithm/tuple_algorithms.hpp>
-#include <cute/numeric/integral_constant.hpp>
-#include <cute/numeric/integer_sequence.hpp>
+#include <cute/config.hpp>                     // CUTE_INLINE_CONSTANT, CUTE_HOST_DEVICE
+#include <cute/container/tuple.hpp>            // cute::is_tuple
+#include <cute/numeric/integral_constant.hpp>  // cute::false_type, cute::true_type
 
 namespace cute
 {
diff --git a/include/cute/util/print.hpp b/include/cute/util/print.hpp
index f1662d07b5..a644290f9f 100644
--- a/include/cute/util/print.hpp
+++ b/include/cute/util/print.hpp
@@ -30,9 +30,13 @@
  **************************************************************************************************/
 #pragma once
 
-#include <cute/config.hpp>
+#include <cute/config.hpp>           // CUTE_HOST_DEVICE
+#include <cute/util/type_traits.hpp> // cute::is_valid
+#include <cute/numeric/numeric_types.hpp> 
 
-#include <cute/util/type_traits.hpp>
+#if defined(CUTLASS_ENABLE_SYCL)
+#define printf sycl::ext::oneapi::experimental::printf
+#endif
 
 #if defined(CUTLASS_ENABLE_SYCL)
 #define printf sycl::ext::oneapi::experimental::printf
@@ -101,6 +105,42 @@ print(int a) {
   printf("%d", a);
 }
 
+CUTE_HOST_DEVICE
+void
+print(uint1b_t a) {
+  printf("%d", int(a));
+}
+
+CUTE_HOST_DEVICE
+void
+print(int2b_t a) {
+  printf("%d", int(a));
+}
+
+CUTE_HOST_DEVICE
+void
+print(uint2b_t a) {
+  printf("%d", int(a));
+}
+
+CUTE_HOST_DEVICE
+void
+print(int4b_t a) {
+  printf("%d", int(a));
+}
+
+CUTE_HOST_DEVICE
+void
+print(uint4b_t a) {
+  printf("%d", int(a));
+}
+
+CUTE_HOST_DEVICE
+void
+print(bin1_t a) {
+  printf("%d", int(a));
+}
+
 CUTE_HOST_DEVICE
 void
 print(unsigned int a) {
@@ -160,50 +200,70 @@ print(char const* format) {
 // pretty printing
 //
 
-template <class T>
 CUTE_HOST_DEVICE void
-pretty_print(T const& v) {
-  printf("  "); print(v);
+pretty_print(uint1b_t a) {
+  printf("%*d", 3, int(a));
+}
+
+CUTE_HOST_DEVICE void
+pretty_print(int2b_t a) {
+  printf("%*d", 5, int(a));
+}
+
+CUTE_HOST_DEVICE void
+pretty_print(uint2b_t a) {
+  printf("%*d", 5, int(a));
 }
 
 CUTE_HOST_DEVICE void
-pretty_print(bool const& v) {
+pretty_print(int4b_t a) {
+  printf("%*d", 5, int(a));
+}
+
+CUTE_HOST_DEVICE void
+pretty_print(uint4b_t a) {
+  printf("%*d", 5, int(a));
+}
+
+CUTE_HOST_DEVICE void
+pretty_print(bool v) {
   printf("%*d", 3, int(v));
 }
 
 CUTE_HOST_DEVICE void
-pretty_print(int32_t const& v) {
+pretty_print(int32_t v) {
   printf("%*d", 5, v);
 }
 
 CUTE_HOST_DEVICE void
-pretty_print(uint32_t const& v) {
+pretty_print(uint32_t v) {
   printf("%*d", 5, v);
 }
 
 CUTE_HOST_DEVICE void
-pretty_print(int64_t const& v) {
+pretty_print(int64_t v) {
   printf("%*lld", 5, static_cast<long long>(v));
 }
 
 CUTE_HOST_DEVICE void
-pretty_print(uint64_t const& v) {
+pretty_print(uint64_t v) {
   printf("%*llu", 5, static_cast<unsigned long long>(v));
 }
 
 CUTE_HOST_DEVICE void
-pretty_print(half_t const& v) {
-  printf("%*.2f", 8, float(v));
+pretty_print(float v) {
+  printf("%*.2e", 10, v);
 }
 
 CUTE_HOST_DEVICE void
-pretty_print(float const& v) {
-  printf("%*.2e", 10, v);
+pretty_print(double v) {
+  printf("%*.3e", 11, v);
 }
 
+template <class T>
 CUTE_HOST_DEVICE void
-pretty_print(double const& v) {
-  printf("%*.3e", 11, v);
+pretty_print(T t) {
+  printf("  "); print(t);
 }
 
 } // end namespace cute
diff --git a/include/cute/util/type_traits.hpp b/include/cute/util/type_traits.hpp
index f12cdb594f..e663b569c6 100644
--- a/include/cute/util/type_traits.hpp
+++ b/include/cute/util/type_traits.hpp
@@ -44,7 +44,7 @@
 #include <limits>       // numeric_limits
 #endif
 
-#include <cute/config.hpp>
+#include <cute/config.hpp> // CUTE_STL_NAMESPACE
 
 namespace cute
 {
@@ -79,6 +79,7 @@ using CUTE_STL_NAMESPACE::is_const_v;
 using CUTE_STL_NAMESPACE::is_volatile;
 using CUTE_STL_NAMESPACE::is_volatile_v;
 
+// Defined in cute/numeric/integral_constant.hpp
 // using CUTE_STL_NAMESPACE::true_type;
 // using CUTE_STL_NAMESPACE::false_type;
 
@@ -274,4 +275,18 @@ struct conditional_template<false, True, False> {
   using type = False<U...>;
 };
 
+//
+// is_any_of
+//
+
+// Member `value` is true if and only if T is same as (is_same_v) at least one of the types in Us
+template <class T, class... Us>
+struct is_any_of {
+  constexpr static bool value = (... || CUTE_STL_NAMESPACE::is_same_v<T, Us>);
+};
+
+// Is true if and only if T is same as (is_same_v) at least one of the types in Us
+template <class T, class... Us>
+inline constexpr bool is_any_of_v = is_any_of<T, Us...>::value;
+
 } // end namespace cute
diff --git a/include/cutlass/arch/barrier.h b/include/cutlass/arch/barrier.h
index 0e1f344f27..11f68aa60d 100644
--- a/include/cutlass/arch/barrier.h
+++ b/include/cutlass/arch/barrier.h
@@ -102,12 +102,24 @@ class NamedBarrier {
     NamedBarrier::arrive_and_wait_internal(num_threads_, id_);
   }
 
+  CUTLASS_DEVICE
+  void arrive_and_wait_unaligned() const {
+    // Note: The value of id_ is already the final barrier id (set correctly in the constructor).
+    NamedBarrier::arrive_and_wait_internal_unaligned(num_threads_, id_);
+  }
+
   CUTLASS_DEVICE
   void arrive() const {
     // Note: The value of id_ is already the final barrier id (set correctly in the constructor).
     NamedBarrier::arrive_internal(num_threads_, id_);
   }
 
+  CUTLASS_DEVICE
+  void arrive_unaligned() const {
+    // Note: The value of id_ is already the final barrier id (set correctly in the constructor).
+    NamedBarrier::arrive_internal_unaligned(num_threads_, id_);
+  }
+
   CUTLASS_DEVICE
   void sync() const {
     NamedBarrier::arrive_and_wait();
@@ -157,6 +169,7 @@ class NamedBarrier {
     sync_internal(num_threads, static_cast<int>(reserved_named_barriers));
   }
 
+
  private:
   CUTLASS_DEVICE
   static void arrive_and_wait_internal(uint32_t num_threads, uint32_t barrier_id) {
@@ -165,6 +178,17 @@ class NamedBarrier {
     __spirv_ControlBarrierWaitINTEL(EXECUTION_SCOPE_WORK_GROUP, MEMORY_SCOPE_WORK_GROUP, MEMORY_SEMANTICS_RELAXED);
 #elif CUDA_BARRIER_ENABLED
     asm volatile("bar.sync %0, %1;" : : "r"(barrier_id), "r"(num_threads));
+    cutlass::arch::synclog_emit_named_barrier_arrive_and_wait(__LINE__, num_threads, barrier_id);
+#elif defined(__CUDA_ARCH__)
+    asm volatile ("brkpt;\n" ::);
+#endif
+  }
+
+  CUTLASS_DEVICE
+  static void arrive_and_wait_internal_unaligned(uint32_t num_threads, uint32_t barrier_id) {
+#if CUDA_BARRIER_ENABLED
+    asm volatile("barrier.sync %0, %1;" : : "r"(barrier_id), "r"(num_threads));
+    cutlass::arch::synclog_emit_named_barrier_arrive_and_wait(__LINE__, num_threads, barrier_id);
 #elif defined(__CUDA_ARCH__)
     asm volatile ("brkpt;\n" ::);
 #endif
@@ -175,12 +199,23 @@ class NamedBarrier {
 #if defined(SYCL_INTEL_TARGET)
     __spirv_ControlBarrierArriveINTEL(EXECUTION_SCOPE_WORK_GROUP, MEMORY_SCOPE_WORK_GROUP, MEMORY_SEMANTICS_RELAXED);
 #elif CUDA_BARRIER_ENABLED
+    cutlass::arch::synclog_emit_named_barrier_arrive(__LINE__, num_threads, barrier_id);
     asm volatile("bar.arrive %0, %1;" : : "r"(barrier_id), "r"(num_threads));
 #elif defined(__CUDA_ARCH__)
     asm volatile ("brkpt;\n" ::);
 #endif
   }
 
+  CUTLASS_DEVICE
+  static void arrive_internal_unaligned(uint32_t num_threads, uint32_t barrier_id) {
+#if CUDA_BARRIER_ENABLED
+    cutlass::arch::synclog_emit_named_barrier_arrive(__LINE__, num_threads, barrier_id);
+    asm volatile("barrier.arrive %0, %1;" : : "r"(barrier_id), "r"(num_threads));
+#elif defined(__CUDA_ARCH__)
+    asm volatile ("brkpt;\n" ::);
+#endif
+  }
+
   CUTLASS_DEVICE
   static void sync_internal(uint32_t num_threads, uint32_t barrier_id) {
     NamedBarrier::arrive_and_wait_internal(num_threads, barrier_id);
@@ -257,6 +292,7 @@ struct ClusterBarrier {
         "}"
         :
         : "r"(arrive_count), "r"(smem_addr));
+    cutlass::arch::synclog_emit_cluster_barrier_init(__LINE__, smem_addr, arrive_count);
 #elif defined(__CUDA_ARCH__)
     asm volatile ("brkpt;\n" ::);
 #endif
@@ -267,6 +303,7 @@ struct ClusterBarrier {
   static void wait(ValueType const* smem_ptr, uint32_t phase) {
 #if CUDA_BARRIER_ENABLED
     uint32_t smem_addr = cute::cast_smem_ptr_to_uint(smem_ptr);
+    cutlass::arch::synclog_emit_cluster_barrier_wait(__LINE__, smem_addr, phase);
     // Arbitrarily large timer value after which try-wait expires and re-tries.
     uint32_t ticks = 0x989680;
     asm volatile(
@@ -290,6 +327,7 @@ struct ClusterBarrier {
   static bool test_wait(ValueType const* smem_ptr, uint32_t phase, uint32_t pred) {
 #if CUDA_BARRIER_ENABLED
     uint32_t smem_addr = cute::cast_smem_ptr_to_uint(smem_ptr);
+    cutlass::arch::synclog_emit_cluster_barrier_test_wait(__LINE__, smem_addr, phase, pred);
     uint32_t waitComplete;
 
     asm volatile(
@@ -314,6 +352,7 @@ struct ClusterBarrier {
   static bool try_wait(ValueType const* smem_ptr, uint32_t phase) {
 #if CUDA_BARRIER_ENABLED
     uint32_t smem_addr = cute::cast_smem_ptr_to_uint(smem_ptr);
+    cutlass::arch::synclog_emit_cluster_barrier_try_wait(__LINE__, smem_addr, phase);
     uint32_t waitComplete;
 
     asm volatile(
@@ -348,6 +387,7 @@ struct ClusterBarrier {
           : "r"(smem_addr), "r"(cta_id));
     }
 
+    cutlass::arch::synclog_emit_cluster_barrier_arrive_cluster(__LINE__, smem_addr, cta_id, pred);
 #elif defined(__CUDA_ARCH__)
     asm volatile ("brkpt;\n" ::);
 #endif
@@ -364,6 +404,7 @@ struct ClusterBarrier {
         "}"
         :
         : "r"(smem_addr));
+    cutlass::arch::synclog_emit_cluster_barrier_arrive(__LINE__, smem_addr);
 #elif defined(__CUDA_ARCH__)
     asm volatile ("brkpt;\n" ::);
 #endif
@@ -440,6 +481,7 @@ struct ClusterTransactionBarrier : public ClusterBarrier {
         "}"
         :
         : "r"(transaction_bytes), "r"(smem_addr));
+    cutlass::arch::synclog_emit_cluster_transaction_barrier_arrive_and_expect_tx(__LINE__, smem_addr, transaction_bytes);
 #elif defined(__CUDA_ARCH__)
     asm volatile ("brkpt;\n" ::);
 #endif
@@ -477,6 +519,7 @@ struct ClusterTransactionBarrier : public ClusterBarrier {
         "}"
         :
         : "r"(transaction_bytes), "r"(smem_addr));
+    cutlass::arch::synclog_emit_cluster_transaction_barrier_expect_transaction(__LINE__, smem_addr, transaction_bytes);
 #elif defined(__CUDA_ARCH__)
     asm volatile ("brkpt;\n" ::);
 #endif
@@ -497,6 +540,7 @@ struct ClusterTransactionBarrier : public ClusterBarrier {
         "}"
         :
         : "r"(transaction_bytes), "r"(smem_addr), "r"(pred));
+    cutlass::arch::synclog_emit_cluster_transaction_barrier_complete_transaction(__LINE__, smem_addr, dst_cta_id, transaction_bytes, pred);
 #elif defined(__CUDA_ARCH__)
     asm volatile ("brkpt;\n" ::);
 #endif
@@ -550,6 +594,7 @@ struct ClusterTransactionBarrier : public ClusterBarrier {
 CUTLASS_DEVICE
 void fence_barrier_init() {
 #if CUDA_BARRIER_ENABLED
+  cutlass::arch::synclog_emit_fence_barrier_init(__LINE__);
   asm volatile(
       "{\n\t"
       "fence.mbarrier_init.release.cluster; \n"
@@ -564,6 +609,7 @@ void fence_barrier_init() {
 CUTLASS_DEVICE
 void fence_view_async_shared() {
 #if CUDA_BARRIER_ENABLED
+    cutlass::arch::synclog_emit_fence_view_async_shared(__LINE__);
     asm volatile (
         "{\n\t"
         "fence.proxy.async.shared::cta; \n"
@@ -585,6 +631,7 @@ void cpasync_barrier_arrive(uint64_t const* smem_ptr) {
     "}"
     :
     : "r"(smem_addr));
+  cutlass::arch::synclog_emit_cpasync_barrier_arrive(__LINE__, smem_addr);
 #elif defined(__CUDA_ARCH__)
   asm volatile ("brkpt;\n" ::);
 #endif
diff --git a/include/cutlass/arch/config.h b/include/cutlass/arch/config.h
new file mode 100644
index 0000000000..b0f750063c
--- /dev/null
+++ b/include/cutlass/arch/config.h
@@ -0,0 +1,81 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+/*! \file
+    \brief Definitions for architecture macros
+*/
+
+#pragma once
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SM90
+#if (__CUDACC_VER_MAJOR__ > 12 || (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ >= 0))
+  #define CUTLASS_ARCH_MMA_SM90_SUPPORTED 1
+  #if (!defined(CUTLASS_ARCH_MMA_SM90_ENABLED) && defined(__CUDA_ARCH__) && __CUDA_ARCH__ == 900)
+    #define CUTLASS_ARCH_MMA_SM90_ENABLED 1
+
+    #if (!defined(CUTLASS_ARCH_MMA_SM90A_ENABLED) && defined(__CUDA_ARCH_FEAT_SM90_ALL))
+      #define CUTLASS_ARCH_MMA_SM90A_ENABLED 1
+    #endif
+  #endif
+#endif
+
+#if (__CUDACC_VER_MAJOR__ >= 12 && __CUDACC_VER_MINOR__ >= 2)
+  #define CUTLASS_ARCH_MMA_SPARSE_SM90_SUPPORTED
+#endif
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SM90 Modifiable
+#if (__CUDACC_VER_MAJOR__ > 12 || (__CUDACC_VER_MAJOR__ == 12 && __CUDACC_VER_MINOR__ >= 3))
+  #define CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90_SUPPORTED 1
+  #if (!defined(CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90_ENABLED) && defined(__CUDA_ARCH__) && __CUDA_ARCH__ == 900)
+    #define CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90_ENABLED 1
+
+    #if (!defined(CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90A_ENABLED) && defined(__CUDA_ARCH_FEAT_SM90_ALL))
+      #define CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90A_ENABLED 1
+    #endif
+  #endif
+#endif
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// SM90 F64
+#if (__CUDACC_VER_MAJOR__ > 11 || (__CUDACC_VER_MAJOR__ == 11 && __CUDACC_VER_MINOR__ >= 8))
+  #define CUTLASS_ARCH_MMA_SM90_F64_MMA_SUPPORTED 1
+  #if (!defined(CUTLASS_ARCH_MMA_SM90_F64_MMA_ENABLED) && defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 900)
+    #define CUTLASS_ARCH_MMA_SM90_F64_MMA_ENABLED 1
+  #endif
+#endif
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
diff --git a/include/cutlass/arch/grid_dependency_control.h b/include/cutlass/arch/grid_dependency_control.h
new file mode 100644
index 0000000000..14ef197497
--- /dev/null
+++ b/include/cutlass/arch/grid_dependency_control.h
@@ -0,0 +1,84 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+ 
+/*! \file
+    \brief Grid dependent control (GDC) helpers for programmatic dependent launches (PDL).
+*/
+
+#pragma once
+
+#include "cute/arch/cluster_sm90.hpp"
+#include "cutlass/arch/barrier.h"
+#include "cutlass/conv/dispatch_policy.hpp"
+#include "cutlass/gemm/dispatch_policy.hpp"
+
+#ifndef CUTLASS_GDC_ENABLED
+  #if (defined(CUTLASS_ENABLE_GDC_FOR_SM90) && \
+     __CUDACC_VER_MAJOR__ >= 12 && \
+     defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 900 && defined(__CUDA_ARCH_FEAT_SM90_ALL))
+    #define CUTLASS_GDC_ENABLED
+  #endif
+#endif
+
+namespace cutlass {
+namespace arch {
+
+// Issuing the launch_dependents instruction hints a dependent kernel to launch earlier
+// launch_dependents doesn't impact the functionality but the performance:
+// Launching a dependent kernel too early can compete with current kernels,
+// while launching too late can lead to a long latency.
+CUTLASS_DEVICE
+void launch_dependent_grids() {
+#if (defined(CUTLASS_GDC_ENABLED))
+  asm volatile("griddepcontrol.launch_dependents;");
+#endif
+}
+
+// Issuing the griddepcontrol.wait instruction enforces no global memory access
+// prior to this istruction. This ensures the correctness of global memory access
+// when launching a dependent kernel earlier.
+CUTLASS_DEVICE
+void wait_on_dependent_grids() {
+#if (defined(CUTLASS_GDC_ENABLED))
+  asm volatile("griddepcontrol.wait;");
+#endif
+}
+
+// Enable kernel-level query regarding whether the GDC feature is turned on
+#if (defined(CUTLASS_GDC_ENABLED))
+static constexpr bool IsGdcGloballyEnabled = true;
+#else
+static constexpr bool IsGdcGloballyEnabled = false;
+#endif
+
+
+} // namespace arch
+} // namespace cutlass
diff --git a/include/cutlass/arch/memory_sm80.h b/include/cutlass/arch/memory_sm80.h
index acaa819567..cb0ba4b54b 100644
--- a/include/cutlass/arch/memory_sm80.h
+++ b/include/cutlass/arch/memory_sm80.h
@@ -326,6 +326,8 @@ struct cp_async<SizeInBytes, CacheOperation::Global> {
         "cp.async only supports CacheOperation::Global when access size is 16B.");
 
       unsigned smem_int_ptr = cutlass_get_smem_pointer(smem_ptr);
+      cutlass::arch::synclog_emit_cp_async(__LINE__, smem_int_ptr, global_ptr, pred_guard, SizeInBytes);
+
       asm volatile(
           "{\n"
           "  .reg .pred p;\n"
@@ -364,6 +366,8 @@ struct cp_async_zfill<SizeInBytes, CacheOperation::Global> {
 
       unsigned smem_int_ptr = cutlass_get_smem_pointer(smem_ptr);
       int src_in_bytes = (pred_guard ? SizeInBytes : 0);
+      cutlass::arch::synclog_emit_cp_async_zfill(__LINE__, smem_int_ptr, global_ptr, pred_guard, SizeInBytes);
+
       asm volatile(
 #if CUTLASS_ENABLE_L2_PREFETCH
         "cp.async.cg.shared.global.L2::128B [%0], [%1], %2, %3;\n" ::"r"(smem_int_ptr),
@@ -401,6 +405,8 @@ struct cp_async_nan<16, CacheOperation::Global> {
                                                  OOB_NAN_F16x2, OOB_NAN_F16x2};
 
       unsigned smem_int_ptr = cutlass_get_smem_pointer(smem_ptr);
+      cutlass::arch::synclog_emit_cp_async_nan(__LINE__, smem_int_ptr, global_ptr, pred_guard);
+
       asm volatile(
           "{\n"
           "  .reg .pred p;\n"
@@ -434,6 +440,7 @@ CUTLASS_DEVICE
 void cp_async_fence() {
   #if CUDA_CP_ASYNC_ACTIVATED
   asm volatile("cp.async.commit_group;\n" ::);
+  cutlass::arch::synclog_emit_cp_async_fence(__LINE__);
   #endif
 }
 
@@ -444,6 +451,7 @@ template <int N>
 CUTLASS_DEVICE void cp_async_wait() {
   #if CUDA_CP_ASYNC_ACTIVATED
   asm volatile("cp.async.wait_group %0;\n" ::"n"(N));
+  cutlass::arch::synclog_emit_cp_async_wait(__LINE__, N);
   #endif
 }
 
@@ -452,6 +460,7 @@ template <>
 CUTLASS_DEVICE void cp_async_wait<0>() {
   #if CUDA_CP_ASYNC_ACTIVATED
   asm volatile("cp.async.wait_all;\n" ::);
+  cutlass::arch::synclog_emit_cp_async_wait_all(__LINE__);
   #endif
 }
 
diff --git a/include/cutlass/arch/mma_sm90.h b/include/cutlass/arch/mma_sm90.h
index d2b167a7ce..1183ee5e05 100644
--- a/include/cutlass/arch/mma_sm90.h
+++ b/include/cutlass/arch/mma_sm90.h
@@ -43,30 +43,7 @@
 #include "mma.h"
 #include "cutlass/layout/matrix.h"
 #include "cutlass/numeric_types.h"
-
-////////////////////////////////////////////////////////////////////////////////
-
-#if ((__CUDACC_VER_MAJOR__ > 11) || (__CUDACC_VER_MAJOR__ == 11 && __CUDACC_VER_MINOR__ >= 8))
-  #define CUTLASS_ARCH_MMA_SM90_F64_MMA_SUPPORTED
-  #if (!defined(CUTLASS_ARCH_MMA_SM90_F64_MMA_ENABLED))
-    #if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900))
-      #define CUTLASS_ARCH_MMA_SM90_F64_MMA_ENABLED
-    #endif
-  #endif
-#endif
-
-#if (__CUDACC_VER_MAJOR__ >= 12)
-  #define CUTLASS_ARCH_MMA_SM90_SUPPORTED
-  #if (!defined(CUTLASS_ARCH_MMA_SM90_ENABLED))
-    #if (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 900))
-      #define CUTLASS_ARCH_MMA_SM90_ENABLED
-    #endif
-  #endif
-#endif
-
-#if ((__CUDACC_VER_MAJOR__ > 12) || ((__CUDACC_VER_MAJOR__ == 12) && (__CUDACC_VER_MINOR__ >= 3)))
-  #define CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90_SUPPORTED
-#endif
+#include "cutlass/arch/config.h"
 
 ////////////////////////////////////////////////////////////////////////////////
 
diff --git a/include/cutlass/arch/reg_reconfig.h b/include/cutlass/arch/reg_reconfig.h
index c1ffbeeb57..d2b434453e 100644
--- a/include/cutlass/arch/reg_reconfig.h
+++ b/include/cutlass/arch/reg_reconfig.h
@@ -37,9 +37,11 @@
 
 #include "cutlass/cutlass.h"
 
-#if (defined(__CUDA_ARCH__) &&\
-    (__CUDA_ARCH__ >= 900) && (__CUDACC_VER_MAJOR__ >= 12) && defined(__CUDA_ARCH_FEAT_SM90_ALL))
+#ifndef CUDA_CTA_RECONFIG_ACTIVATED
+  #if (__CUDACC_VER_MAJOR__ >= 12 && \
+    defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 900 && defined(__CUDA_ARCH_FEAT_SM90_ALL))
     #define CUDA_CTA_RECONFIG_ACTIVATED 1
+  #endif
 #endif
 
 namespace cutlass {
diff --git a/include/cutlass/arch/synclog.hpp b/include/cutlass/arch/synclog.hpp
new file mode 100644
index 0000000000..ea683859a3
--- /dev/null
+++ b/include/cutlass/arch/synclog.hpp
@@ -0,0 +1,1324 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/*! \file
+    \brief Synchronization event logging for race condition debugging.
+*/
+
+#pragma once
+
+#include "cutlass/detail/helper_macros.hpp"
+
+#if defined(__CUDACC_RTC__)
+#include <cuda/std/cstdint>
+#else
+#include <cstdint>
+#endif
+
+#if !defined(__CUDACC_RTC__)
+#include <mutex>
+#include <vector>
+#endif
+
+namespace cutlass {
+namespace arch {
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+#if defined(CUTLASS_ENABLE_SYNCLOG)
+
+constexpr uint32_t synclog_cap = 1 << 26;
+
+inline std::mutex synclog_mutex;
+inline std::vector<uint32_t*> synclog_buf_list;
+#if defined(__NVCC__) || (defined(__clang__) && defined(__CUDA__))
+inline __device__ uint32_t* synclog_buf;
+#endif
+
+CUTLASS_DEVICE
+uint32_t* synclog_alloc(uint32_t n) {
+  #if defined(__NVCC__) || (defined(__clang__) && defined(__CUDA__))
+  uint32_t* buf = synclog_buf;
+  if (buf == nullptr) return nullptr;
+  uint32_t last = atomicAdd(&buf[0], n);
+  if (last + n < synclog_cap) return buf + last + 1;
+  if (last >= synclog_cap) atomicAdd(&buf[0], -n);
+  #endif
+  return nullptr;
+}
+
+CUTLASS_DEVICE
+void synclog_emit_prefix(uint32_t* to, uint32_t header, uint32_t line) {
+  #if defined(__NVCC__) || (defined(__clang__) && defined(__CUDA__))
+  uint64_t time64;
+  asm volatile (
+    "mov.u64 %0, %%globaltimer;\n"
+    : "=l"(time64) :
+  );
+  to[0] = header;
+  to[1] = line;
+  to[2] = time64;
+  to[3] = time64 >> 32;
+  to[4] = threadIdx.x;
+  to[5] = threadIdx.y;
+  to[6] = threadIdx.z;
+  to[7] = blockIdx.x;
+  to[8] = blockIdx.y;
+  to[9] = blockIdx.z;
+  #endif
+}
+
+constexpr uint32_t synclog_header_none = 0;
+constexpr uint32_t synclog_length_prefix = 1 + 1 + 2 + 3 + 3;
+
+constexpr bool     synclog_enable_syncthreads = true;
+constexpr uint32_t synclog_header_syncthreads = 1;
+constexpr uint32_t synclog_length_syncthreads = synclog_length_prefix + 0;
+
+constexpr bool     synclog_enable_syncwarp = true;
+constexpr uint32_t synclog_header_syncwarp = 2;
+constexpr uint32_t synclog_length_syncwarp = synclog_length_prefix + 0;
+
+constexpr bool     synclog_enable_named_barrier_arrive_and_wait = true;
+constexpr uint32_t synclog_header_named_barrier_arrive_and_wait = 3;
+constexpr uint32_t synclog_length_named_barrier_arrive_and_wait = synclog_length_prefix + 2;
+
+constexpr bool     synclog_enable_named_barrier_arrive = true;
+constexpr uint32_t synclog_header_named_barrier_arrive = 4;
+constexpr uint32_t synclog_length_named_barrier_arrive = synclog_length_prefix + 2;
+
+constexpr bool     synclog_enable_cluster_barrier_init = true;
+constexpr uint32_t synclog_header_cluster_barrier_init = 5;
+constexpr uint32_t synclog_length_cluster_barrier_init = synclog_length_prefix + 2;
+
+constexpr bool     synclog_enable_cluster_barrier_wait = true;
+constexpr uint32_t synclog_header_cluster_barrier_wait = 6;
+constexpr uint32_t synclog_length_cluster_barrier_wait = synclog_length_prefix + 4;
+
+constexpr bool     synclog_enable_cluster_barrier_test_wait = true;
+constexpr uint32_t synclog_header_cluster_barrier_test_wait = 7;
+constexpr uint32_t synclog_length_cluster_barrier_test_wait = synclog_length_prefix + 5;
+
+constexpr bool     synclog_enable_cluster_barrier_try_wait = true;
+constexpr uint32_t synclog_header_cluster_barrier_try_wait = 8;
+constexpr uint32_t synclog_length_cluster_barrier_try_wait = synclog_length_prefix + 4;
+
+constexpr bool     synclog_enable_cluster_barrier_arrive_cluster = true;
+constexpr uint32_t synclog_header_cluster_barrier_arrive_cluster = 9;
+constexpr uint32_t synclog_length_cluster_barrier_arrive_cluster = synclog_length_prefix + 5;
+
+constexpr bool     synclog_enable_cluster_barrier_arrive = true;
+constexpr uint32_t synclog_header_cluster_barrier_arrive = 10;
+constexpr uint32_t synclog_length_cluster_barrier_arrive = synclog_length_prefix + 3;
+
+constexpr bool     synclog_enable_cluster_barrier_invalidate = true;
+constexpr uint32_t synclog_header_cluster_barrier_invalidate = 11;
+constexpr uint32_t synclog_length_cluster_barrier_invalidate = synclog_length_prefix + 3;
+
+constexpr bool     synclog_enable_cluster_transaction_barrier_arrive_and_expect_tx = true;
+constexpr uint32_t synclog_header_cluster_transaction_barrier_arrive_and_expect_tx = 12;
+constexpr uint32_t synclog_length_cluster_transaction_barrier_arrive_and_expect_tx = synclog_length_prefix + 4;
+
+constexpr bool     synclog_enable_cluster_transaction_barrier_arrive_and_expect_tx_cluster = true;
+constexpr uint32_t synclog_header_cluster_transaction_barrier_arrive_and_expect_tx_cluster = 13;
+constexpr uint32_t synclog_length_cluster_transaction_barrier_arrive_and_expect_tx_cluster = synclog_length_prefix + 6;
+
+constexpr bool     synclog_enable_cluster_transaction_barrier_expect_transaction = true;
+constexpr uint32_t synclog_header_cluster_transaction_barrier_expect_transaction = 14;
+constexpr uint32_t synclog_length_cluster_transaction_barrier_expect_transaction = synclog_length_prefix + 4;
+
+constexpr bool     synclog_enable_cluster_transaction_barrier_complete_transaction = true;
+constexpr uint32_t synclog_header_cluster_transaction_barrier_complete_transaction = 15;
+constexpr uint32_t synclog_length_cluster_transaction_barrier_complete_transaction = synclog_length_prefix + 6;
+
+constexpr bool     synclog_enable_fence_barrier_init = true;
+constexpr uint32_t synclog_header_fence_barrier_init = 16;
+constexpr uint32_t synclog_length_fence_barrier_init = synclog_length_prefix + 0;
+
+constexpr bool     synclog_enable_fence_view_async_shared = true;
+constexpr uint32_t synclog_header_fence_view_async_shared = 17;
+constexpr uint32_t synclog_length_fence_view_async_shared = synclog_length_prefix + 0;
+
+constexpr bool     synclog_enable_cp_async_wait = true;
+constexpr uint32_t synclog_header_cp_async_wait = 18;
+constexpr uint32_t synclog_length_cp_async_wait = synclog_length_prefix + 1;
+
+constexpr bool     synclog_enable_cp_async_wait_all = true;
+constexpr uint32_t synclog_header_cp_async_wait_all = 19;
+constexpr uint32_t synclog_length_cp_async_wait_all = synclog_length_prefix + 0;
+
+constexpr bool     synclog_enable_cp_async_fence = true;
+constexpr uint32_t synclog_header_cp_async_fence = 20;
+constexpr uint32_t synclog_length_cp_async_fence = synclog_length_prefix + 0;
+
+constexpr bool     synclog_enable_cp_async_nan = true;
+constexpr uint32_t synclog_header_cp_async_nan = 21;
+constexpr uint32_t synclog_length_cp_async_nan = synclog_length_prefix + 4;
+
+constexpr bool     synclog_enable_cp_async_zfill = true;
+constexpr uint32_t synclog_header_cp_async_zfill = 22;
+constexpr uint32_t synclog_length_cp_async_zfill = synclog_length_prefix + 5;
+
+constexpr bool     synclog_enable_cp_async = true;
+constexpr uint32_t synclog_header_cp_async = 23;
+constexpr uint32_t synclog_length_cp_async = synclog_length_prefix + 5;
+
+constexpr bool     synclog_enable_tma_load = true;
+constexpr uint32_t synclog_header_tma_load = 24;
+constexpr uint32_t synclog_length_tma_load = synclog_length_prefix + 4;
+
+constexpr bool     synclog_enable_tma_store = true;
+constexpr uint32_t synclog_header_tma_store = 25;
+constexpr uint32_t synclog_length_tma_store = synclog_length_prefix + 3;
+
+constexpr bool     synclog_enable_tma_store_arrive = true;
+constexpr uint32_t synclog_header_tma_store_arrive = 26;
+constexpr uint32_t synclog_length_tma_store_arrive = synclog_length_prefix + 0;
+
+constexpr bool     synclog_enable_tma_store_wait = true;
+constexpr uint32_t synclog_header_tma_store_wait = 27;
+constexpr uint32_t synclog_length_tma_store_wait = synclog_length_prefix + 1;
+
+constexpr bool     synclog_enable_warpgroup_arrive = true;
+constexpr uint32_t synclog_header_warpgroup_arrive = 28;
+constexpr uint32_t synclog_length_warpgroup_arrive = synclog_length_prefix + 0;
+
+constexpr bool     synclog_enable_warpgroup_wait = true;
+constexpr uint32_t synclog_header_warpgroup_wait = 29;
+constexpr uint32_t synclog_length_warpgroup_wait = synclog_length_prefix + 1;
+
+constexpr bool     synclog_enable_warpgroup_commit_batch = true;
+constexpr uint32_t synclog_header_warpgroup_commit_batch = 30;
+constexpr uint32_t synclog_length_warpgroup_commit_batch = synclog_length_prefix + 0;
+
+constexpr bool     synclog_enable_wgmma_reg_smem = true;
+constexpr uint32_t synclog_header_wgmma_reg_smem = 31;
+constexpr uint32_t synclog_length_wgmma_reg_smem = synclog_length_prefix + 2;
+
+constexpr bool     synclog_enable_wgmma_smem_smem = true;
+constexpr uint32_t synclog_header_wgmma_smem_smem = 32;
+constexpr uint32_t synclog_length_wgmma_smem_smem = synclog_length_prefix + 4;
+
+constexpr bool     synclog_enable_cpasync_barrier_arrive = true;
+constexpr uint32_t synclog_header_cpasync_barrier_arrive = 33;
+constexpr uint32_t synclog_length_cpasync_barrier_arrive = synclog_length_prefix + 3;
+
+CUTLASS_DEVICE
+bool synclog_condition_emit() {
+  #if defined(__NVCC__) || (defined(__clang__) && defined(__CUDA__))
+  return threadIdx.x%NumThreadsPerWarp == 0 && threadIdx.y == 0 && threadIdx.z == 0 &&
+    blockIdx.x == 0 && blockIdx.y == 0 && blockIdx.z == 0;
+  #else
+  return 0;
+  #endif
+}
+
+CUTLASS_DEVICE
+bool synclog_condition_print() {
+  #if defined(__NVCC__) || (defined(__clang__) && defined(__CUDA__))
+  return threadIdx.x == 0 && threadIdx.y == 0 && threadIdx.z == 0 &&
+    blockIdx.x == 0 && blockIdx.y == 0 && blockIdx.z == 0;
+  #else
+  return false;
+  #endif
+}
+
+CUTLASS_DEVICE
+void synclog_print_prefix(char const* header, uint32_t at) {
+  #if defined(__NVCC__) || (defined(__clang__) && defined(__CUDA__))
+  uint32_t line = synclog_buf[at + 1];
+  uint32_t timeLo = synclog_buf[at + 2];
+  uint32_t timeHi = synclog_buf[at + 3];
+  uint32_t threadIdxX = synclog_buf[at + 4];
+  uint32_t threadIdxY = synclog_buf[at + 5];
+  uint32_t threadIdxZ = synclog_buf[at + 6];
+  uint32_t blockIdxX = synclog_buf[at + 7];
+  uint32_t blockIdxY = synclog_buf[at + 8];
+  uint32_t blockIdxZ = synclog_buf[at + 9];
+  printf(
+    "%s line=%u time=%lu thread=%u,%u,%u block=%u,%u,%u ",
+    header, line,
+    (uint64_t)timeHi << 32 | timeLo,
+    threadIdxX, threadIdxY, threadIdxZ,
+    blockIdxX, blockIdxY, blockIdxZ
+  );
+  #endif
+}
+
+CUTLASS_DEVICE
+uint64_t synclog_mbarrier_bits(uint32_t smem_addr) {
+  uint64_t bits = 0;
+  asm volatile (
+    "mbarrier.inval.shared::cta.b64 [%1];\n"
+    "ld.shared::cta.b64 %0, [%1];\n"
+    : "=l"(bits) : "r"(smem_addr)
+  );
+  return bits;
+}
+
+CUTLASS_DEVICE
+void synclog_print_wgmma_desc(char const* str, uint32_t lo, uint32_t hi, char const* sep) {
+  CUTLASS_UNUSED(hi);
+  uint32_t smem_int_ptr = (lo & ((1 << 14) - 1)) << 4;
+  printf("%s_smem_int_ptr=%u%s", str, smem_int_ptr, sep);
+}
+
+#endif // defined(CUTLASS_ENABLE_SYNCLOG)
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+inline void synclog_setup() {
+  #if defined(CUTLASS_ENABLE_SYNCLOG)
+  #if defined(__NVCC__) || (defined(__clang__) && defined(__CUDA__))
+  std::scoped_lock lock(synclog_mutex);
+  auto fail = [] () {
+    fprintf(stderr, "synclog_setup() failed\n");
+    std::terminate();
+  };
+  int orig_device = 0;
+  if (cudaGetDevice(&orig_device) != cudaSuccess) {
+    fail();
+  }
+  int device_count = 0;
+  if (cudaGetDeviceCount(&device_count) != cudaSuccess) {
+    fail();
+  }
+  if (synclog_buf_list.size() == 0) {
+    for (int device = 0; device < device_count; device++) {
+      uint32_t* buf = 0;
+      if (cudaSetDevice(device) != cudaSuccess ||
+        cudaMalloc(&buf, synclog_cap * sizeof(uint32_t)) != cudaSuccess) {
+        fail();
+      }
+      synclog_buf_list.push_back(buf);
+    }
+  }
+  for (int device = 0; device < device_count; device++) {
+    uint32_t* buf = synclog_buf_list.at(device);
+    if (cudaSetDevice(device) != cudaSuccess ||
+      cudaMemset(buf, 0, synclog_cap * sizeof(uint32_t)) != cudaSuccess ||
+      cudaMemcpyToSymbol(synclog_buf, &buf, sizeof(buf)) != cudaSuccess) {
+      fail();
+    }
+  }
+  if (cudaSetDevice(orig_device) != cudaSuccess) {
+    fail();
+  }
+  #endif
+  #endif // defined(CUTLASS_ENABLE_SYNCLOG)
+}
+
+CUTLASS_DEVICE
+void synclog_emit_syncthreads(uint32_t line) {
+  #if defined(CUTLASS_ENABLE_SYNCLOG)
+  if constexpr (!synclog_enable_syncthreads) return;
+  if (!synclog_condition_emit()) return;
+  uint32_t* to = synclog_alloc(synclog_length_syncthreads);
+  if (to == nullptr) return;
+  synclog_emit_prefix(to, synclog_header_syncthreads, line);
+  #else
+  CUTLASS_UNUSED(line);
+  #endif // defined(CUTLASS_ENABLE_SYNCLOG)
+}
+
+CUTLASS_DEVICE
+void synclog_emit_syncwarp(uint32_t line) {
+  #if defined(CUTLASS_ENABLE_SYNCLOG)
+  if constexpr (!synclog_enable_syncwarp) return;
+  if (!synclog_condition_emit()) return;
+  uint32_t* to = synclog_alloc(synclog_length_syncwarp);
+  if (to == nullptr) return;
+  synclog_emit_prefix(to, synclog_header_syncwarp, line);
+  #else
+  CUTLASS_UNUSED(line);
+  #endif // defined(CUTLASS_ENABLE_SYNCLOG)
+}
+
+CUTLASS_DEVICE
+void synclog_emit_named_barrier_arrive_and_wait(
+  uint32_t line,
+  uint32_t num_threads,
+  uint32_t barrier_id) {
+  #if defined(CUTLASS_ENABLE_SYNCLOG)
+  if constexpr (!synclog_enable_named_barrier_arrive_and_wait) return;
+  if (!synclog_condition_emit()) return;
+  uint32_t* to = synclog_alloc(synclog_length_named_barrier_arrive_and_wait);
+  if (to == nullptr) return;
+  synclog_emit_prefix(to, synclog_header_named_barrier_arrive_and_wait, line);
+  to[synclog_length_prefix + 0] = num_threads;
+  to[synclog_length_prefix + 1] = barrier_id;
+  #else
+  CUTLASS_UNUSED(line);
+  CUTLASS_UNUSED(num_threads);
+  CUTLASS_UNUSED(barrier_id);
+  #endif // defined(CUTLASS_ENABLE_SYNCLOG)
+}
+
+CUTLASS_DEVICE
+void synclog_emit_named_barrier_arrive(
+  uint32_t line,
+  uint32_t num_threads,
+  uint32_t barrier_id) {
+  #if defined(CUTLASS_ENABLE_SYNCLOG)
+  if constexpr (!synclog_enable_named_barrier_arrive) return;
+  if (!synclog_condition_emit()) return;
+  uint32_t* to = synclog_alloc(synclog_length_named_barrier_arrive);
+  if (to == nullptr) return;
+  synclog_emit_prefix(to, synclog_header_named_barrier_arrive, line);
+  to[synclog_length_prefix + 0] = num_threads;
+  to[synclog_length_prefix + 1] = barrier_id;
+  #else
+  CUTLASS_UNUSED(line);
+  CUTLASS_UNUSED(num_threads);
+  CUTLASS_UNUSED(barrier_id);
+  #endif // defined(CUTLASS_ENABLE_SYNCLOG)
+}
+
+CUTLASS_DEVICE
+void synclog_emit_cluster_barrier_init(
+  uint32_t line,
+  uint32_t smem_addr,
+  uint32_t arrive_count) {
+  #if defined(CUTLASS_ENABLE_SYNCLOG)
+  if constexpr (!synclog_enable_cluster_barrier_init) return;
+  if (!synclog_condition_emit()) return;
+  uint32_t* to = synclog_alloc(synclog_length_cluster_barrier_init);
+  if (to == nullptr) return;
+  synclog_emit_prefix(to, synclog_header_cluster_barrier_init, line);
+  to[synclog_length_prefix + 0] = smem_addr;
+  to[synclog_length_prefix + 1] = arrive_count;
+  #else
+  CUTLASS_UNUSED(line);
+  CUTLASS_UNUSED(smem_addr);
+  CUTLASS_UNUSED(arrive_count);
+  #endif // defined(CUTLASS_ENABLE_SYNCLOG)
+}
+
+CUTLASS_DEVICE
+void synclog_emit_cluster_barrier_wait(
+  uint32_t line,
+  uint32_t smem_addr,
+  uint32_t phase) {
+  #if defined(CUTLASS_ENABLE_SYNCLOG)
+  if constexpr (!synclog_enable_cluster_barrier_wait) return;
+  if (!synclog_condition_emit()) return;
+  uint64_t bits = synclog_mbarrier_bits(smem_addr);
+  uint32_t* to = synclog_alloc(synclog_length_cluster_barrier_wait);
+  if (to == nullptr) return;
+  synclog_emit_prefix(to, synclog_header_cluster_barrier_wait, line);
+  to[synclog_length_prefix + 0] = smem_addr;
+  to[synclog_length_prefix + 1] = phase;
+  to[synclog_length_prefix + 2] = bits;
+  to[synclog_length_prefix + 3] = bits >> 32;
+  #else
+  CUTLASS_UNUSED(line);
+  CUTLASS_UNUSED(smem_addr);
+  CUTLASS_UNUSED(phase);
+  #endif // defined(CUTLASS_ENABLE_SYNCLOG)
+}
+
+CUTLASS_DEVICE
+void synclog_emit_cluster_barrier_test_wait(
+  uint32_t line,
+  uint32_t smem_addr,
+  uint32_t phase,
+  uint32_t pred) {
+  #if defined(CUTLASS_ENABLE_SYNCLOG)
+  if constexpr (!synclog_enable_cluster_barrier_test_wait) return;
+  if (!synclog_condition_emit()) return;
+  uint64_t bits = synclog_mbarrier_bits(smem_addr);
+  uint32_t* to = synclog_alloc(synclog_length_cluster_barrier_test_wait);
+  if (to == nullptr) return;
+  synclog_emit_prefix(to, synclog_header_cluster_barrier_test_wait, line);
+  to[synclog_length_prefix + 0] = smem_addr;
+  to[synclog_length_prefix + 1] = phase;
+  to[synclog_length_prefix + 2] = pred;
+  to[synclog_length_prefix + 3] = bits;
+  to[synclog_length_prefix + 4] = bits >> 32;
+  #else
+  CUTLASS_UNUSED(line);
+  CUTLASS_UNUSED(smem_addr);
+  CUTLASS_UNUSED(phase);
+  CUTLASS_UNUSED(pred);
+  #endif // defined(CUTLASS_ENABLE_SYNCLOG)
+}
+
+CUTLASS_DEVICE
+void synclog_emit_cluster_barrier_try_wait(
+  uint32_t line,
+  uint32_t smem_addr,
+  uint32_t phase) {
+  #if defined(CUTLASS_ENABLE_SYNCLOG)
+  if constexpr (!synclog_enable_cluster_barrier_try_wait) return;
+  if (!synclog_condition_emit()) return;
+  uint64_t bits = synclog_mbarrier_bits(smem_addr);
+  uint32_t* to = synclog_alloc(synclog_length_cluster_barrier_try_wait);
+  if (to == nullptr) return;
+  synclog_emit_prefix(to, synclog_header_cluster_barrier_try_wait, line);
+  to[synclog_length_prefix + 0] = smem_addr;
+  to[synclog_length_prefix + 1] = phase;
+  to[synclog_length_prefix + 2] = bits;
+  to[synclog_length_prefix + 3] = bits >> 32;
+  #else
+  CUTLASS_UNUSED(line);
+  CUTLASS_UNUSED(smem_addr);
+  CUTLASS_UNUSED(phase);  
+  #endif // defined(CUTLASS_ENABLE_SYNCLOG)
+}
+
+CUTLASS_DEVICE
+void synclog_emit_cluster_barrier_arrive_cluster(
+  uint32_t line,
+  uint32_t smem_addr,
+  uint32_t cta_id,
+  uint32_t pred) {
+  #if defined(CUTLASS_ENABLE_SYNCLOG)
+  if constexpr (!synclog_enable_cluster_barrier_arrive_cluster) return;
+  if (!synclog_condition_emit()) return;
+  uint64_t bits = synclog_mbarrier_bits(smem_addr);
+  uint32_t* to = synclog_alloc(synclog_length_cluster_barrier_arrive_cluster);
+  if (to == nullptr) return;
+  synclog_emit_prefix(to, synclog_header_cluster_barrier_arrive_cluster, line);
+  to[synclog_length_prefix + 0] = smem_addr;
+  to[synclog_length_prefix + 1] = cta_id;
+  to[synclog_length_prefix + 2] = pred;
+  to[synclog_length_prefix + 3] = bits;
+  to[synclog_length_prefix + 4] = bits >> 32;
+  #else
+  CUTLASS_UNUSED(line);
+  CUTLASS_UNUSED(smem_addr);
+  CUTLASS_UNUSED(cta_id);
+  CUTLASS_UNUSED(pred);
+  #endif // defined(CUTLASS_ENABLE_SYNCLOG)
+}
+
+CUTLASS_DEVICE
+void synclog_emit_cluster_barrier_arrive(
+  uint32_t line,
+  uint32_t smem_addr) {
+  #if defined(CUTLASS_ENABLE_SYNCLOG)
+  if constexpr (!synclog_enable_cluster_barrier_arrive) return;
+  if (!synclog_condition_emit()) return;
+  uint64_t bits = synclog_mbarrier_bits(smem_addr);
+  uint32_t* to = synclog_alloc(synclog_length_cluster_barrier_arrive);
+  if (to == nullptr) return;
+  synclog_emit_prefix(to, synclog_header_cluster_barrier_arrive, line);
+  to[synclog_length_prefix + 0] = smem_addr;
+  to[synclog_length_prefix + 1] = bits;
+  to[synclog_length_prefix + 2] = bits >> 32;
+  #else
+  CUTLASS_UNUSED(line);
+  CUTLASS_UNUSED(smem_addr);
+  #endif // defined(CUTLASS_ENABLE_SYNCLOG)
+}
+
+CUTLASS_DEVICE
+void synclog_emit_cluster_barrier_invalidate(
+  uint32_t line,
+  uint32_t smem_addr) {
+  #if defined(CUTLASS_ENABLE_SYNCLOG)
+  if constexpr (!synclog_enable_cluster_barrier_invalidate) return;
+  if (!synclog_condition_emit()) return;
+  uint64_t bits = synclog_mbarrier_bits(smem_addr);
+  uint32_t* to = synclog_alloc(synclog_length_cluster_barrier_invalidate);
+  if (to == nullptr) return;
+  synclog_emit_prefix(to, synclog_header_cluster_barrier_invalidate, line);
+  to[synclog_length_prefix + 0] = smem_addr;
+  to[synclog_length_prefix + 1] = bits;
+  to[synclog_length_prefix + 2] = bits >> 32;
+  #else
+  CUTLASS_UNUSED(line);
+  CUTLASS_UNUSED(smem_addr);
+  #endif // defined(CUTLASS_ENABLE_SYNCLOG)
+}
+
+CUTLASS_DEVICE
+void synclog_emit_cluster_transaction_barrier_arrive_and_expect_tx(
+  uint32_t line,
+  uint32_t smem_addr,
+  uint32_t transaction_bytes) {
+  #if defined(CUTLASS_ENABLE_SYNCLOG)
+  if constexpr (!synclog_enable_cluster_transaction_barrier_arrive_and_expect_tx) return;
+  if (!synclog_condition_emit()) return;
+  uint64_t bits = synclog_mbarrier_bits(smem_addr);
+  uint32_t* to = synclog_alloc(synclog_length_cluster_transaction_barrier_arrive_and_expect_tx);
+  if (to == nullptr) return;
+  synclog_emit_prefix(to, synclog_header_cluster_transaction_barrier_arrive_and_expect_tx, line);
+  to[synclog_length_prefix + 0] = smem_addr;
+  to[synclog_length_prefix + 1] = transaction_bytes;
+  to[synclog_length_prefix + 2] = bits;
+  to[synclog_length_prefix + 3] = bits >> 32;
+  #else
+  CUTLASS_UNUSED(line);
+  CUTLASS_UNUSED(smem_addr);
+  CUTLASS_UNUSED(transaction_bytes);
+  #endif // defined(CUTLASS_ENABLE_SYNCLOG)
+}
+
+CUTLASS_DEVICE
+void synclog_emit_cluster_transaction_barrier_arrive_and_expect_tx_cluster(
+  uint32_t line,
+  uint32_t smem_addr,
+  uint32_t transaction_bytes,
+  uint32_t cta_id,
+  uint32_t pred) {
+  #if defined(CUTLASS_ENABLE_SYNCLOG)
+  if constexpr (!synclog_enable_cluster_transaction_barrier_arrive_and_expect_tx_cluster) return;
+  if (!synclog_condition_emit()) return;
+  uint64_t bits = synclog_mbarrier_bits(smem_addr);
+  uint32_t* to = synclog_alloc(synclog_length_cluster_transaction_barrier_arrive_and_expect_tx_cluster);
+  if (to == nullptr) return;
+  synclog_emit_prefix(to, synclog_header_cluster_transaction_barrier_arrive_and_expect_tx_cluster, line);
+  to[synclog_length_prefix + 0] = smem_addr;
+  to[synclog_length_prefix + 1] = transaction_bytes;
+  to[synclog_length_prefix + 2] = cta_id;
+  to[synclog_length_prefix + 3] = pred;
+  to[synclog_length_prefix + 4] = bits;
+  to[synclog_length_prefix + 5] = bits >> 32;
+  #else
+  CUTLASS_UNUSED(line);
+  CUTLASS_UNUSED(smem_addr);
+  CUTLASS_UNUSED(transaction_bytes);
+  CUTLASS_UNUSED(cta_id);
+  CUTLASS_UNUSED(pred);
+  #endif // defined(CUTLASS_ENABLE_SYNCLOG)
+}
+
+CUTLASS_DEVICE
+void synclog_emit_cluster_transaction_barrier_expect_transaction(
+  uint32_t line,
+  uint32_t smem_addr,
+  uint32_t transaction_bytes) {
+  #if defined(CUTLASS_ENABLE_SYNCLOG)
+  if constexpr (!synclog_enable_cluster_transaction_barrier_expect_transaction) return;
+  if (!synclog_condition_emit()) return;
+  uint64_t bits = synclog_mbarrier_bits(smem_addr);
+  uint32_t* to = synclog_alloc(synclog_length_cluster_transaction_barrier_expect_transaction);
+  if (to == nullptr) return;
+  synclog_emit_prefix(to, synclog_header_cluster_transaction_barrier_expect_transaction, line);
+  to[synclog_length_prefix + 0] = smem_addr;
+  to[synclog_length_prefix + 1] = transaction_bytes;
+  to[synclog_length_prefix + 2] = bits;
+  to[synclog_length_prefix + 2] = bits >> 32;
+  #else
+  CUTLASS_UNUSED(line);
+  CUTLASS_UNUSED(smem_addr);
+  CUTLASS_UNUSED(transaction_bytes);
+  #endif // defined(CUTLASS_ENABLE_SYNCLOG)
+}
+
+CUTLASS_DEVICE
+void synclog_emit_cluster_transaction_barrier_complete_transaction(
+  uint32_t line,
+  uint32_t smem_addr,
+  uint32_t dst_cta_id,
+  uint32_t transaction_bytes,
+  uint32_t pred) {
+  #if defined(CUTLASS_ENABLE_SYNCLOG)
+  if constexpr (!synclog_enable_cluster_transaction_barrier_complete_transaction) return;
+  if (!synclog_condition_emit()) return;
+  uint64_t bits = synclog_mbarrier_bits(smem_addr);
+  uint32_t* to = synclog_alloc(synclog_length_cluster_transaction_barrier_complete_transaction);
+  if (to == nullptr) return;
+  synclog_emit_prefix(to, synclog_header_cluster_transaction_barrier_complete_transaction, line);
+  to[synclog_length_prefix + 0] = smem_addr;
+  to[synclog_length_prefix + 1] = dst_cta_id;
+  to[synclog_length_prefix + 2] = transaction_bytes;
+  to[synclog_length_prefix + 3] = pred;
+  to[synclog_length_prefix + 4] = bits;
+  to[synclog_length_prefix + 5] = bits >> 32;
+  #else
+  CUTLASS_UNUSED(line);
+  CUTLASS_UNUSED(smem_addr);
+  CUTLASS_UNUSED(dst_cta_id);
+  CUTLASS_UNUSED(transaction_bytes);
+  CUTLASS_UNUSED(pred);
+  #endif // defined(CUTLASS_ENABLE_SYNCLOG)
+}
+
+CUTLASS_DEVICE
+void synclog_emit_fence_barrier_init(uint32_t line) {
+  #if defined(CUTLASS_ENABLE_SYNCLOG)
+  if constexpr (!synclog_enable_fence_barrier_init) return;
+  if (!synclog_condition_emit()) return;
+  uint32_t* to = synclog_alloc(synclog_length_fence_barrier_init);
+  if (to == nullptr) return;
+  synclog_emit_prefix(to, synclog_header_fence_barrier_init, line);
+  #else
+  CUTLASS_UNUSED(line);
+  #endif // defined(CUTLASS_ENABLE_SYNCLOG)
+}
+
+CUTLASS_DEVICE
+void synclog_emit_fence_view_async_shared(uint32_t line) {
+  #if defined(CUTLASS_ENABLE_SYNCLOG)
+  if constexpr (!synclog_enable_fence_view_async_shared) return;
+  if (!synclog_condition_emit()) return;
+  uint32_t* to = synclog_alloc(synclog_length_fence_view_async_shared);
+  if (to == nullptr) return;
+  synclog_emit_prefix(to, synclog_header_fence_view_async_shared, line);
+  #else
+  CUTLASS_UNUSED(line);
+  #endif // defined(CUTLASS_ENABLE_SYNCLOG)
+}
+
+CUTLASS_DEVICE
+void synclog_emit_cp_async_wait(
+  uint32_t line,
+  uint32_t n) {
+  #if defined(CUTLASS_ENABLE_SYNCLOG)
+  if constexpr (!synclog_enable_cp_async_wait) return;
+  if (!synclog_condition_emit()) return;
+  uint32_t* to = synclog_alloc(synclog_length_cp_async_wait);
+  if (to == nullptr) return;
+  synclog_emit_prefix(to, synclog_header_cp_async_wait, line);
+  to[synclog_length_prefix + 0] = n;
+  #else
+  CUTLASS_UNUSED(line);
+  CUTLASS_UNUSED(n);
+  #endif // defined(CUTLASS_ENABLE_SYNCLOG)
+}
+
+CUTLASS_DEVICE
+void synclog_emit_cp_async_wait_all(uint32_t line) {
+  #if defined(CUTLASS_ENABLE_SYNCLOG)
+  if constexpr (!synclog_enable_cp_async_wait_all) return;
+  if (!synclog_condition_emit()) return;
+  uint32_t* to = synclog_alloc(synclog_length_cp_async_wait_all);
+  if (to == nullptr) return;
+  synclog_emit_prefix(to, synclog_header_cp_async_wait_all, line);
+  #else
+  CUTLASS_UNUSED(line);
+  #endif // defined(CUTLASS_ENABLE_SYNCLOG)
+}
+
+CUTLASS_DEVICE
+void synclog_emit_cp_async_fence(uint32_t line) {
+  #if defined(CUTLASS_ENABLE_SYNCLOG)
+  if constexpr (!synclog_enable_cp_async_fence) return;
+  if (!synclog_condition_emit()) return;
+  uint32_t* to = synclog_alloc(synclog_length_cp_async_fence);
+  if (to == nullptr) return;
+  synclog_emit_prefix(to, synclog_header_cp_async_fence, line);
+  #else
+  CUTLASS_UNUSED(line);
+  #endif // defined(CUTLASS_ENABLE_SYNCLOG)
+}
+
+CUTLASS_DEVICE
+void synclog_emit_cp_async_nan(
+  uint32_t line,
+  uint32_t smem_addr,
+  const void* gmem_ptr,
+  uint32_t pred) {
+  #if defined(CUTLASS_ENABLE_SYNCLOG)
+  if constexpr (!synclog_enable_cp_async_nan) return;
+  if (!synclog_condition_emit()) return;
+  uint32_t* to = synclog_alloc(synclog_length_cp_async_nan);
+  if (to == nullptr) return;
+  synclog_emit_prefix(to, synclog_header_cp_async_nan, line);
+  to[synclog_length_prefix + 0] = smem_addr;
+  to[synclog_length_prefix + 1] = (uint32_t)((uint64_t)gmem_ptr);
+  to[synclog_length_prefix + 2] = (uint32_t)((uint64_t)gmem_ptr >> 32);
+  to[synclog_length_prefix + 3] = pred;
+  #else
+  CUTLASS_UNUSED(line);
+  CUTLASS_UNUSED(smem_addr);
+  CUTLASS_UNUSED(gmem_ptr);
+  CUTLASS_UNUSED(pred);
+  #endif // defined(CUTLASS_ENABLE_SYNCLOG)
+}
+
+CUTLASS_DEVICE
+void synclog_emit_cp_async_zfill(
+  uint32_t line,
+  uint32_t smem_addr,
+  const void* gmem_ptr,
+  uint32_t pred,
+  uint32_t size) {
+  #if defined(CUTLASS_ENABLE_SYNCLOG)
+  if constexpr (!synclog_enable_cp_async_zfill) return;
+  if (!synclog_condition_emit()) return;
+  uint32_t* to = synclog_alloc(synclog_length_cp_async_zfill);
+  if (to == nullptr) return;
+  synclog_emit_prefix(to, synclog_header_cp_async_zfill, line);
+  to[synclog_length_prefix + 0] = smem_addr;
+  to[synclog_length_prefix + 1] = (uint32_t)((uint64_t)gmem_ptr);
+  to[synclog_length_prefix + 2] = (uint32_t)((uint64_t)gmem_ptr >> 32);
+  to[synclog_length_prefix + 3] = pred;
+  to[synclog_length_prefix + 4] = size;
+  #else
+  CUTLASS_UNUSED(line);
+  CUTLASS_UNUSED(smem_addr);
+  CUTLASS_UNUSED(gmem_ptr);
+  CUTLASS_UNUSED(pred);
+  CUTLASS_UNUSED(size);
+  #endif // defined(CUTLASS_ENABLE_SYNCLOG)
+}
+
+CUTLASS_DEVICE
+void synclog_emit_cp_async(
+  uint32_t line,
+  uint32_t smem_addr,
+  const void* gmem_ptr,
+  uint32_t pred,
+  uint32_t size) {
+  #if defined(CUTLASS_ENABLE_SYNCLOG)
+  if constexpr (!synclog_enable_cp_async) return;
+  if (!synclog_condition_emit()) return;
+  uint32_t* to = synclog_alloc(synclog_length_cp_async);
+  if (to == nullptr) return;
+  synclog_emit_prefix(to, synclog_header_cp_async, line);
+  to[synclog_length_prefix + 0] = smem_addr;
+  to[synclog_length_prefix + 1] = (uint32_t)((uint64_t)gmem_ptr);
+  to[synclog_length_prefix + 2] = (uint32_t)((uint64_t)gmem_ptr >> 32);
+  to[synclog_length_prefix + 3] = pred;
+  to[synclog_length_prefix + 4] = size;
+  #else
+  CUTLASS_UNUSED(line);
+  CUTLASS_UNUSED(smem_addr);
+  CUTLASS_UNUSED(gmem_ptr);
+  CUTLASS_UNUSED(pred);
+  CUTLASS_UNUSED(size);
+  #endif // defined(CUTLASS_ENABLE_SYNCLOG)
+}
+
+CUTLASS_DEVICE
+void synclog_emit_tma_load(
+  uint32_t line,
+  uint64_t gmem_int_desc,
+  uint32_t smem_int_mbar,
+  uint32_t smem_int_ptr) {
+  #if defined(CUTLASS_ENABLE_SYNCLOG)
+  if constexpr (!synclog_enable_tma_load) return;
+  if (!synclog_condition_emit()) return;
+  uint32_t* to = synclog_alloc(synclog_length_tma_load);
+  if (to == nullptr) return;
+  synclog_emit_prefix(to, synclog_header_tma_load, line);
+  to[synclog_length_prefix + 0] = (uint32_t)((uint64_t)gmem_int_desc);
+  to[synclog_length_prefix + 1] = (uint32_t)((uint64_t)gmem_int_desc >> 32);
+  to[synclog_length_prefix + 2] = smem_int_mbar;
+  to[synclog_length_prefix + 3] = smem_int_ptr;
+  #else
+  CUTLASS_UNUSED(line);
+  CUTLASS_UNUSED(gmem_int_desc);
+  CUTLASS_UNUSED(smem_int_mbar);
+  CUTLASS_UNUSED(smem_int_ptr);
+  #endif // defined(CUTLASS_ENABLE_SYNCLOG)
+}
+
+CUTLASS_DEVICE
+void synclog_emit_tma_store(
+  uint32_t line,
+  uint64_t gmem_int_desc,
+  uint32_t smem_int_ptr) {
+  #if defined(CUTLASS_ENABLE_SYNCLOG)
+  if constexpr (!synclog_enable_tma_store) return;
+  if (!synclog_condition_emit()) return;
+  uint32_t* to = synclog_alloc(synclog_length_tma_store);
+  if (to == nullptr) return;
+  synclog_emit_prefix(to, synclog_header_tma_store, line);
+  to[synclog_length_prefix + 0] = (uint32_t)((uint64_t)gmem_int_desc);
+  to[synclog_length_prefix + 1] = (uint32_t)((uint64_t)gmem_int_desc >> 32);
+  to[synclog_length_prefix + 2] = smem_int_ptr;
+  #else
+  CUTLASS_UNUSED(line);
+  CUTLASS_UNUSED(gmem_int_desc);
+  CUTLASS_UNUSED(smem_int_ptr);
+  #endif // defined(CUTLASS_ENABLE_SYNCLOG)
+}
+
+CUTLASS_DEVICE
+void synclog_emit_tma_store_arrive(uint32_t line) {
+  #if defined(CUTLASS_ENABLE_SYNCLOG)
+  if constexpr (!synclog_enable_tma_store_arrive) return;
+  if (!synclog_condition_emit()) return;
+  uint32_t* to = synclog_alloc(synclog_length_tma_store_arrive);
+  if (to == nullptr) return;
+  synclog_emit_prefix(to, synclog_header_tma_store_arrive, line);
+  #else
+  CUTLASS_UNUSED(line);
+  #endif // defined(CUTLASS_ENABLE_SYNCLOG)
+}
+
+CUTLASS_DEVICE
+void synclog_emit_tma_store_wait(
+  uint32_t line,
+  uint32_t count) {
+  #if defined(CUTLASS_ENABLE_SYNCLOG)
+  if constexpr (!synclog_enable_tma_store_wait) return;
+  if (!synclog_condition_emit()) return;
+  uint32_t* to = synclog_alloc(synclog_length_tma_store_wait);
+  if (to == nullptr) return;
+  synclog_emit_prefix(to, synclog_header_tma_store_wait, line);
+  to[synclog_length_prefix + 0] = count;
+  #else
+  CUTLASS_UNUSED(line);
+  CUTLASS_UNUSED(count);
+  #endif // defined(CUTLASS_ENABLE_SYNCLOG)
+}
+
+CUTLASS_DEVICE
+void synclog_emit_warpgroup_arrive(
+  uint32_t line) {
+  #if defined(CUTLASS_ENABLE_SYNCLOG)
+  if constexpr (!synclog_enable_warpgroup_arrive) return;
+  if (!synclog_condition_emit()) return;
+  uint32_t* to = synclog_alloc(synclog_length_warpgroup_arrive);
+  if (to == nullptr) return;
+  synclog_emit_prefix(to, synclog_header_warpgroup_arrive, line);
+  #else
+  CUTLASS_UNUSED(line);
+  #endif // defined(CUTLASS_ENABLE_SYNCLOG)
+}
+
+CUTLASS_DEVICE
+void synclog_emit_warpgroup_wait(
+  uint32_t line,
+  uint32_t n) {
+  #if defined(CUTLASS_ENABLE_SYNCLOG)
+  if constexpr (!synclog_enable_warpgroup_wait) return;
+  if (!synclog_condition_emit()) return;
+  uint32_t* to = synclog_alloc(synclog_length_warpgroup_wait);
+  if (to == nullptr) return;
+  synclog_emit_prefix(to, synclog_header_warpgroup_wait, line);
+  to[synclog_length_prefix + 0] = n;
+  #else
+  CUTLASS_UNUSED(line);
+  CUTLASS_UNUSED(n);
+  #endif // defined(CUTLASS_ENABLE_SYNCLOG)
+}
+
+CUTLASS_DEVICE
+void synclog_emit_warpgroup_commit_batch(
+  uint32_t line) {
+  #if defined(CUTLASS_ENABLE_SYNCLOG)
+  if constexpr (!synclog_enable_warpgroup_commit_batch) return;
+  if (!synclog_condition_emit()) return;
+  uint32_t* to = synclog_alloc(synclog_length_warpgroup_commit_batch);
+  if (to == nullptr) return;
+  synclog_emit_prefix(to, synclog_header_warpgroup_commit_batch, line);
+  #else
+  CUTLASS_UNUSED(line);
+  #endif // defined(CUTLASS_ENABLE_SYNCLOG)
+}
+
+CUTLASS_DEVICE
+void synclog_emit_wgmma_reg_smem(
+  uint32_t line,
+  uint64_t desc_b) {
+  #if defined(CUTLASS_ENABLE_SYNCLOG)
+  if constexpr (!synclog_enable_wgmma_reg_smem) return;
+  if (!synclog_condition_emit()) return;
+  uint32_t* to = synclog_alloc(synclog_length_wgmma_reg_smem);
+  if (to == nullptr) return;
+  synclog_emit_prefix(to, synclog_header_wgmma_reg_smem, line);
+  to[synclog_length_prefix + 0] = desc_b;
+  to[synclog_length_prefix + 1] = desc_b >> 32;
+  #else
+  CUTLASS_UNUSED(line);
+  CUTLASS_UNUSED(desc_b);
+  #endif // defined(CUTLASS_ENABLE_SYNCLOG)
+}
+
+CUTLASS_DEVICE
+void synclog_emit_wgmma_smem_smem(
+  uint32_t line,
+  uint64_t desc_a,
+  uint64_t desc_b) {
+  #if defined(CUTLASS_ENABLE_SYNCLOG)
+  if constexpr (!synclog_enable_wgmma_smem_smem) return;
+  if (!synclog_condition_emit()) return;
+  uint32_t* to = synclog_alloc(synclog_length_wgmma_smem_smem);
+  if (to == nullptr) return;
+  synclog_emit_prefix(to, synclog_header_wgmma_smem_smem, line);
+  to[synclog_length_prefix + 0] = desc_a;
+  to[synclog_length_prefix + 1] = desc_a >> 32;
+  to[synclog_length_prefix + 2] = desc_b;
+  to[synclog_length_prefix + 3] = desc_b >> 32;
+  #else
+  CUTLASS_UNUSED(line);
+  CUTLASS_UNUSED(desc_a);
+  CUTLASS_UNUSED(desc_b);
+  #endif // defined(CUTLASS_ENABLE_SYNCLOG)
+}
+
+CUTLASS_DEVICE
+void synclog_emit_cpasync_barrier_arrive(
+  uint32_t line,
+  uint32_t smem_addr) {
+  #if defined(CUTLASS_ENABLE_SYNCLOG)
+  if constexpr (!synclog_enable_cpasync_barrier_arrive) return;
+  if (!synclog_condition_emit()) return;
+  uint64_t bits = synclog_mbarrier_bits(smem_addr);
+  uint32_t* to = synclog_alloc(synclog_length_cpasync_barrier_arrive);
+  if (to == nullptr) return;
+  synclog_emit_prefix(to, synclog_header_cpasync_barrier_arrive, line);
+  to[synclog_length_prefix + 0] = smem_addr;
+  to[synclog_length_prefix + 1] = bits;
+  to[synclog_length_prefix + 2] = bits >> 32;
+  #else
+  CUTLASS_UNUSED(line);
+  CUTLASS_UNUSED(smem_addr);
+  #endif // defined(CUTLASS_ENABLE_SYNCLOG)
+}
+
+#if !defined(CUTLASS_ENABLE_SYNCLOG)
+CUTLASS_DEVICE
+#elif defined(__NVCC__) || (defined(__clang__) && defined(__CUDA__))
+static __attribute__((__noinline__)) __device__
+#else
+static __attribute__((__noinline__))
+#endif
+void synclog_print() {
+  #if defined(CUTLASS_ENABLE_SYNCLOG)
+  #if defined(__NVCC__) || (defined(__clang__) && defined(__CUDA__))
+  if (synclog_buf == nullptr || !synclog_condition_print()) {
+    return;
+  }
+  printf("synclog start\n");
+  for (uint32_t at = 1; at < synclog_cap; ) {
+    uint32_t header = synclog_buf[at];
+    if (header == synclog_header_none) {
+      break;
+    }
+    printf("synclog at %u: ", at);
+    if constexpr (synclog_enable_syncthreads) {
+      if (header == synclog_header_syncthreads) {
+        synclog_print_prefix("syncthreads", at);
+        at += synclog_length_syncthreads;
+        printf("\n");
+        continue;
+      }
+    }
+    if constexpr (synclog_enable_syncwarp) {
+      if (header == synclog_header_syncwarp) {
+        synclog_print_prefix("syncwarp", at);
+        at += synclog_length_syncwarp;
+        printf("\n");
+        continue;
+      }
+    }
+    if constexpr (synclog_enable_named_barrier_arrive_and_wait) {
+      if (header == synclog_header_named_barrier_arrive_and_wait) {
+        synclog_print_prefix("named_barrier_arrive_and_wait", at);
+        at += synclog_length_named_barrier_arrive_and_wait;
+        printf("num_threads=%u barrier_id=%u\n", synclog_buf[at-2], synclog_buf[at-1]);
+        continue;
+      }
+    }
+    if constexpr (synclog_enable_named_barrier_arrive) {
+      if (header == synclog_header_named_barrier_arrive) {
+        synclog_print_prefix("named_barrier_arrive", at);
+        at += synclog_length_named_barrier_arrive;
+        printf("num_threads=%u barrier_id=%u\n", synclog_buf[at-2], synclog_buf[at-1]);
+        continue;
+      }
+    }
+    if constexpr (synclog_enable_cluster_barrier_init) {
+      if (header == synclog_header_cluster_barrier_init) {
+        synclog_print_prefix("cluster_barrier_init", at);
+        at += synclog_length_cluster_barrier_init;
+        printf("smem_addr=%u arrive_count=%u\n", synclog_buf[at-2], synclog_buf[at-1]);
+        continue;
+      }
+    }
+    if constexpr (synclog_enable_cluster_barrier_wait) {
+      if (header == synclog_header_cluster_barrier_wait) {
+        synclog_print_prefix("cluster_barrier_wait", at);
+        at += synclog_length_cluster_barrier_wait;
+        printf("smem_addr=%u phase=%u", synclog_buf[at-4], synclog_buf[at-3]);
+        continue;
+      }
+    }
+    if constexpr (synclog_enable_cluster_barrier_test_wait) {
+      if (header == synclog_header_cluster_barrier_test_wait) {
+        synclog_print_prefix("cluster_barrier_test_wait", at);
+        at += synclog_length_cluster_barrier_test_wait;
+        printf("smem_addr=%u phase=%u pred=%u", synclog_buf[at-5], synclog_buf[at-4], synclog_buf[at-3]);
+        continue;
+      }
+    }
+    if constexpr (synclog_enable_cluster_barrier_try_wait) {
+      if (header == synclog_header_cluster_barrier_try_wait) {
+        synclog_print_prefix("cluster_barrier_try_wait", at);
+        at += synclog_length_cluster_barrier_try_wait;
+        printf("smem_addr=%u phase=%u", synclog_buf[at-4], synclog_buf[at-3]);
+        continue;
+      }
+    }
+    if constexpr (synclog_enable_cluster_barrier_arrive_cluster) {
+      if (header == synclog_header_cluster_barrier_arrive_cluster) {
+        synclog_print_prefix("cluster_barrier_arrive_cluster", at);
+        at += synclog_length_cluster_barrier_arrive_cluster;
+        printf("smem_addr=%u cta_id=%u pred=%u", synclog_buf[at-5], synclog_buf[at-4], synclog_buf[at-3]);
+        continue;
+      }
+    }
+    if constexpr (synclog_enable_cluster_barrier_arrive) {
+      if (header == synclog_header_cluster_barrier_arrive) {
+        synclog_print_prefix("cluster_barrier_arrive", at);
+        at += synclog_length_cluster_barrier_arrive;
+        printf("smem_addr=%u", synclog_buf[at-3]);
+        continue;
+      }
+    }
+    if constexpr (synclog_enable_cluster_barrier_invalidate) {
+      if (header == synclog_header_cluster_barrier_invalidate) {
+        synclog_print_prefix("cluster_barrier_invalidate", at);
+        at += synclog_length_cluster_barrier_invalidate;
+        printf("smem_addr=%u", synclog_buf[at-3]);
+        continue;
+      }
+    }
+    if constexpr (synclog_enable_cluster_transaction_barrier_arrive_and_expect_tx) {
+      if (header == synclog_header_cluster_transaction_barrier_arrive_and_expect_tx) {
+        synclog_print_prefix("cluster_transaction_barrier_arrive_and_expect_tx", at);
+        at += synclog_length_cluster_transaction_barrier_arrive_and_expect_tx;
+        printf("smem_addr=%u transaction_bytes=%u", synclog_buf[at-4], synclog_buf[at-3]);
+        continue;
+      }
+    }
+    if constexpr (synclog_enable_cluster_transaction_barrier_arrive_and_expect_tx_cluster) {
+      if (header == synclog_header_cluster_transaction_barrier_arrive_and_expect_tx_cluster) {
+        synclog_print_prefix("cluster_transaction_barrier_arrive_and_expect_tx_cluster", at);
+        at += synclog_length_cluster_transaction_barrier_arrive_and_expect_tx_cluster;
+        printf("smem_addr=%u transaction_bytes=%u cta_id=%u pred=%u", synclog_buf[at-6], synclog_buf[at-5], synclog_buf[at-4], synclog_buf[at-3]);
+        continue;
+      }
+    }
+    if constexpr (synclog_enable_cluster_transaction_barrier_expect_transaction) {
+      if (header == synclog_header_cluster_transaction_barrier_expect_transaction) {
+        synclog_print_prefix("cluster_transaction_barrier_expect_transaction", at);
+        at += synclog_length_cluster_transaction_barrier_expect_transaction;
+        printf("smem_addr=%u transaction_bytes=%u", synclog_buf[at-4], synclog_buf[at-3]);
+        continue;
+      }
+    }
+    if constexpr (synclog_enable_cluster_transaction_barrier_complete_transaction) {
+      if (header == synclog_header_cluster_transaction_barrier_complete_transaction) {
+        synclog_print_prefix("cluster_transaction_barrier_complete_transaction", at);
+        at += synclog_length_cluster_transaction_barrier_complete_transaction;
+        printf("smem_addr=%u dst_cta_id=%u transaction_bytes=%u pred=%u", synclog_buf[at-6], synclog_buf[at-5], synclog_buf[at-4], synclog_buf[at-3]);
+        continue;
+      }
+    }
+    if constexpr (synclog_enable_fence_barrier_init) {
+      if (header == synclog_header_fence_barrier_init) {
+        synclog_print_prefix("fence_barrier_init", at);
+        at += synclog_length_fence_barrier_init;
+        printf("\n");
+        continue;
+      }
+    }
+    if constexpr (synclog_enable_fence_view_async_shared) {
+      if (header == synclog_header_fence_view_async_shared) {
+        synclog_print_prefix("fence_view_async_shared", at);
+        at += synclog_length_fence_view_async_shared;
+        printf("\n");
+        continue;
+      }
+    }
+    if constexpr (synclog_enable_cp_async_wait) {
+      if (header == synclog_header_cp_async_wait) {
+        synclog_print_prefix("cp_async_wait", at);
+        at += synclog_length_cp_async_wait;
+        printf("n=%u\n", synclog_buf[at-1]);
+        continue;
+      }
+    }
+    if constexpr (synclog_enable_cp_async_wait_all) {
+      if (header == synclog_header_cp_async_wait_all) {
+        synclog_print_prefix("cp_async_wait_all", at);
+        at += synclog_length_cp_async_wait_all;
+        printf("\n");
+        continue;
+      }
+    }
+    if constexpr (synclog_enable_cp_async_fence) {
+      if (header == synclog_header_cp_async_fence) {
+        synclog_print_prefix("cp_async_fence", at);
+        at += synclog_length_cp_async_fence;
+        printf("\n");
+        continue;
+      }
+    }
+    if constexpr (synclog_enable_cp_async_nan) {
+      if (header == synclog_header_cp_async_nan) {
+        synclog_print_prefix("cp_async_nan", at);
+        at += synclog_length_cp_async_nan;
+        uint64_t gmem_addr = synclog_buf[at-3];
+        gmem_addr += (uint64_t)synclog_buf[at-2] << 32;
+        printf("smem_addr=%u gmem_addr=%llu pred=%u\n", synclog_buf[at-4], gmem_addr, synclog_buf[at-1]);
+        continue;
+      }
+    }
+    if constexpr (synclog_enable_cp_async_zfill) {
+      if (header == synclog_header_cp_async_zfill) {
+        synclog_print_prefix("cp_async_zfill", at);
+        at += synclog_length_cp_async_zfill;
+        uint64_t gmem_addr = synclog_buf[at-4];
+        gmem_addr += (uint64_t)synclog_buf[at-3] << 32;
+        printf("smem_addr=%u gmem_addr=%llu pred=%u size=%u\n", synclog_buf[at-5], gmem_addr, synclog_buf[at-2], synclog_buf[at-1]);
+        continue;
+      }
+    }
+    if constexpr (synclog_enable_cp_async) {
+      if (header == synclog_header_cp_async) {
+        synclog_print_prefix("cp_async", at);
+        at += synclog_length_cp_async;
+        uint64_t gmem_addr = synclog_buf[at-4];
+        gmem_addr += (uint64_t)synclog_buf[at-3] << 32;
+        printf("smem_addr=%u gmem_addr=%llu pred=%u size=%u\n", synclog_buf[at-5], gmem_addr, synclog_buf[at-2], synclog_buf[at-1]);
+        continue;
+      }
+    }
+    if constexpr (synclog_enable_tma_load) {
+      if (header == synclog_header_tma_load) {
+        synclog_print_prefix("tma_load", at);
+        at += synclog_length_tma_load;
+        uint64_t gmem_int_desc = synclog_buf[at-4];
+        gmem_int_desc += (uint64_t)synclog_buf[at-3] << 32;
+        printf("gmem_int_desc=%llu smem_int_mbar=%u smem_int_ptr=%u\n", gmem_int_desc, synclog_buf[at-2], synclog_buf[at-1]);
+        continue;
+      }
+    }
+    if constexpr (synclog_enable_tma_store) {
+      if (header == synclog_header_tma_store) {
+        synclog_print_prefix("tma_store", at);
+        at += synclog_length_tma_store;
+        uint64_t gmem_int_desc = synclog_buf[at-3];
+        gmem_int_desc += (uint64_t)synclog_buf[at-2] << 32;
+        printf("gmem_int_desc=%llu smem_int_ptr=%u\n", gmem_int_desc, synclog_buf[at-1]);
+        continue;
+      }
+    }
+    if constexpr (synclog_enable_tma_store_arrive) {
+      if (header == synclog_header_tma_store_arrive) {
+        synclog_print_prefix("tma_store_arrive", at);
+        at += synclog_length_tma_store_arrive;
+        printf("\n");
+        continue;
+      }
+    }
+    if constexpr (synclog_enable_tma_store_wait) {
+      if (header == synclog_header_tma_store_wait) {
+        synclog_print_prefix("tma_store_wait", at);
+        at += synclog_length_tma_store_wait;
+        printf("count=%u\n", synclog_buf[at-1]);
+        continue;
+      }
+    }
+    if constexpr (synclog_enable_warpgroup_arrive) {
+      if (header == synclog_header_warpgroup_arrive) {
+        synclog_print_prefix("warpgroup_arrive", at);
+        at += synclog_length_warpgroup_arrive;
+        printf("\n");
+        continue;
+      }
+    }
+    if constexpr (synclog_enable_warpgroup_wait) {
+      if (header == synclog_header_warpgroup_wait) {
+        synclog_print_prefix("warpgroup_wait", at);
+        at += synclog_length_warpgroup_wait;
+        printf("n=%u\n", synclog_buf[at-1]);
+        continue;
+      }
+    }
+    if constexpr (synclog_enable_warpgroup_commit_batch) {
+      if (header == synclog_header_warpgroup_commit_batch) {
+        synclog_print_prefix("warpgroup_commit_batch", at);
+        at += synclog_length_warpgroup_commit_batch;
+        printf("\n");
+        continue;
+      }
+    }
+    if constexpr (synclog_enable_wgmma_reg_smem) {
+      if (header == synclog_header_wgmma_reg_smem) {
+        synclog_print_prefix("wgmma_reg_smem", at);
+        at += synclog_length_wgmma_reg_smem;
+        synclog_print_wgmma_desc("desc_b", synclog_buf[at-2], synclog_buf[at-1], "");
+        printf("\n");
+        continue;
+      }
+    }
+    if constexpr (synclog_enable_wgmma_smem_smem) {
+      if (header == synclog_header_wgmma_smem_smem) {
+        synclog_print_prefix("wgmma_smem_smem", at);
+        at += synclog_length_wgmma_smem_smem;
+        synclog_print_wgmma_desc("desc_a", synclog_buf[at-4], synclog_buf[at-3], " ");
+        synclog_print_wgmma_desc("desc_b", synclog_buf[at-2], synclog_buf[at-1], "");
+        printf("\n");
+        continue;
+      }
+    }
+    if constexpr (synclog_enable_cpasync_barrier_arrive) {
+      if (header == synclog_header_cpasync_barrier_arrive) {
+        synclog_print_prefix("cpasync_barrier_arrive", at);
+        at += synclog_length_cpasync_barrier_arrive;
+        printf("smem_addr=%u", synclog_buf[at-3]);
+        continue;
+      }
+    }
+    asm volatile ("brkpt;\n" ::);
+  }
+  if (synclog_buf[0] >= synclog_cap) {
+    printf(
+      "synclog was truncated (exceeded capacity of %lu bytes)\n",
+      (synclog_cap - 1) * sizeof(uint32_t)
+    );
+  }
+  printf("synclog end\n");
+  #endif
+  #endif // defined(CUTLASS_ENABLE_SYNCLOG)
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+#if defined(CUTLASS_ENABLE_SYNCLOG)
+#undef __syncthreads
+#define __syncthreads() do {\
+  cutlass::arch::synclog_emit_syncthreads(__LINE__);\
+  __syncthreads();\
+} while (0)
+#endif // defined(CUTLASS_ENABLE_SYNCLOG)
+
+#if defined(CUTLASS_ENABLE_SYNCLOG)
+#undef __syncwarp
+#define __syncwarp(...) do {\
+  cutlass::arch::synclog_emit_syncwarp(__LINE__);\
+  __syncwarp(__VA_ARGS__);\
+} while (0)
+#endif // defined(CUTLASS_ENABLE_SYNCLOG)
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+} // namespace arch
+} // namespace cutlass
diff --git a/include/cutlass/array.h b/include/cutlass/array.h
index 499d45c724..62e9469497 100644
--- a/include/cutlass/array.h
+++ b/include/cutlass/array.h
@@ -37,6 +37,7 @@
 #include "cutlass/cutlass.h"
 #include "cutlass/functional.h"
 #include "cutlass/numeric_types.h"
+#include "cutlass/platform/platform.h"
 namespace cutlass {
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
@@ -49,6 +50,23 @@ template <
 >
 struct Array;
 
+namespace detail {
+
+template<class T>
+struct is_Array : platform::false_type {};
+
+template <
+  typename T,
+  int N,
+  bool RegisterSized
+>
+struct is_Array<Array<T, N, RegisterSized> > : platform::true_type {};
+
+template<typename T>
+constexpr bool is_Array_v = is_Array<T>::value;
+
+} // namespace detail
+
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
 /// Defines the size of an Array<> in bits
@@ -803,111 +821,14 @@ struct reciprocal_approximate_ftz<Array<T, N>> {
   }
 };
 
-template <typename T, int N>
-struct maximum<Array<T, N>, false> {
-
-  CUTLASS_HOST_DEVICE
-  Array<T, N> operator()(Array<T, N> const &lhs, Array<T, N> const &rhs) const {
-
-    Array<T, N> result;
-    maximum<T, false> scalar_op;
-
-    CUTLASS_PRAGMA_UNROLL
-    for (int i = 0; i < N; ++i) {
-      result[i] = scalar_op(lhs[i], rhs[i]);
-    }
-
-    return result;
-  }
-
-  CUTLASS_HOST_DEVICE
-  Array<T, N> operator()(Array<T, N> const &lhs, T const &scalar) const {
-
-    Array<T, N> result;
-    maximum<T, false> scalar_op;
-
-    CUTLASS_PRAGMA_UNROLL
-    for (int i = 0; i < N; ++i) {
-      result[i] = scalar_op(lhs[i], scalar);
-    }
-
-    return result;
-  }
-
-  CUTLASS_HOST_DEVICE
-  Array<T, N> operator()(T const &scalar, Array<T, N> const &rhs) const {
-
-    Array<T, N> result;
-    maximum<T, false> scalar_op;
-
-    CUTLASS_PRAGMA_UNROLL
-    for (int i = 0; i < N; ++i) {
-      result[i] = scalar_op(scalar, rhs[i]);
-    }
-
-    return result;
-  }
-};
-
-template <typename T, int N>
-struct maximum<Array<T, N>, true> {
-
-  CUTLASS_HOST_DEVICE
-  Array<T, N> operator()(Array<T, N> const &lhs, Array<T, N> const &rhs) const {
-
-    Array<T, N> result;
-    maximum<T, true> scalar_op;
-
-    CUTLASS_PRAGMA_UNROLL
-    for (int i = 0; i < N; ++i) {
-      result[i] = scalar_op(lhs[i], rhs[i]);
-    }
-
-    return result;
-  }
-
-  CUTLASS_HOST_DEVICE
-  Array<T, N> operator()(Array<T, N> const &lhs, T const &scalar) const {
-
-    Array<T, N> result;
-    maximum<T, true> scalar_op;
-
-    CUTLASS_PRAGMA_UNROLL
-    for (int i = 0; i < N; ++i) {
-      result[i] = scalar_op(lhs[i], scalar);
-    }
-
-    return result;
-  }
-
-  CUTLASS_HOST_DEVICE
-  Array<T, N> operator()(T const &scalar, Array<T, N> const &rhs) const {
-
-    Array<T, N> result;
-    maximum<T, true> scalar_op;
-
-    CUTLASS_PRAGMA_UNROLL
-    for (int i = 0; i < N; ++i) {
-      result[i] = scalar_op(scalar, rhs[i]);
-    }
-
-    return result;
-  }
-};
-
-template <typename T, int N>
-struct minimum<Array<T, N>, false> {
-
-  CUTLASS_HOST_DEVICE
-  static T scalar_op(T const &lhs, T const &rhs) {
-    return (rhs < lhs ? rhs : lhs);
-  }
+template <typename T, int N, bool PropagateNaN>
+struct maximum<Array<T, N>, PropagateNaN> {
 
   CUTLASS_HOST_DEVICE
   Array<T, N> operator()(Array<T, N> const &lhs, Array<T, N> const &rhs) const {
 
     Array<T, N> result;
-    minimum<T, false> scalar_op;
+    maximum<T, PropagateNaN> scalar_op;
 
     CUTLASS_PRAGMA_UNROLL
     for (int i = 0; i < N; ++i) {
@@ -921,7 +842,7 @@ struct minimum<Array<T, N>, false> {
   Array<T, N> operator()(Array<T, N> const &lhs, T const &scalar) const {
 
     Array<T, N> result;
-    minimum<T, false> scalar_op;
+    maximum<T, PropagateNaN> scalar_op;
 
     CUTLASS_PRAGMA_UNROLL
     for (int i = 0; i < N; ++i) {
@@ -935,7 +856,7 @@ struct minimum<Array<T, N>, false> {
   Array<T, N> operator()(T const &scalar, Array<T, N> const &rhs) const {
 
     Array<T, N> result;
-    minimum<T, false> scalar_op;
+    maximum<T, PropagateNaN> scalar_op;
 
     CUTLASS_PRAGMA_UNROLL
     for (int i = 0; i < N; ++i) {
@@ -946,8 +867,8 @@ struct minimum<Array<T, N>, false> {
   }
 };
 
-template <typename T, int N>
-struct minimum<Array<T, N>, true> {
+template <typename T, int N, bool PropagateNaN>
+struct minimum<Array<T, N>, PropagateNaN> {
 
   CUTLASS_HOST_DEVICE
   static T scalar_op(T const &lhs, T const &rhs) {
@@ -958,7 +879,7 @@ struct minimum<Array<T, N>, true> {
   Array<T, N> operator()(Array<T, N> const &lhs, Array<T, N> const &rhs) const {
 
     Array<T, N> result;
-    minimum<T, true> scalar_op;
+    minimum<T, PropagateNaN> scalar_op;
 
     CUTLASS_PRAGMA_UNROLL
     for (int i = 0; i < N; ++i) {
@@ -972,7 +893,7 @@ struct minimum<Array<T, N>, true> {
   Array<T, N> operator()(Array<T, N> const &lhs, T const &scalar) const {
 
     Array<T, N> result;
-    minimum<T, true> scalar_op;
+    minimum<T, PropagateNaN> scalar_op;
 
     CUTLASS_PRAGMA_UNROLL
     for (int i = 0; i < N; ++i) {
@@ -986,7 +907,7 @@ struct minimum<Array<T, N>, true> {
   Array<T, N> operator()(T const &scalar, Array<T, N> const &rhs) const {
 
     Array<T, N> result;
-    minimum<T, true> scalar_op;
+    minimum<T, PropagateNaN> scalar_op;
 
     CUTLASS_PRAGMA_UNROLL
     for (int i = 0; i < N; ++i) {
@@ -2030,8 +1951,8 @@ struct multiply_add_relu0<Array<half_t, N>, Array<half_t, N>, Array<half_t, N>>
   }
 };
 
-template <int N>
-struct minimum<Array<half_t, N>, false> {
+template <int N, bool PropagateNaN>
+struct minimum<Array<half_t, N>, PropagateNaN> {
   CUTLASS_HOST_DEVICE
   Array<half_t, N> operator()(Array<half_t, N> const & lhs, Array<half_t, N> const &rhs) const {
     Array<half_t, N> result;
@@ -2043,25 +1964,27 @@ struct minimum<Array<half_t, N>, false> {
 
     CUTLASS_PRAGMA_UNROLL
     for (int i = 0; i < N / 2; ++i) {
-      result_ptr[i] = __hmin2(lhs_ptr[i], rhs_ptr[i]);
+      result_ptr[i] = PropagateNaN ? __hmin2_nan(lhs_ptr[i], rhs_ptr[i])
+                                   : __hmin2(lhs_ptr[i], rhs_ptr[i]);
     }
 
     if constexpr (N % 2) {
       __half const *a_residual_ptr = reinterpret_cast<__half const *>(&lhs);
       __half const *b_residual_ptr = reinterpret_cast<__half const *>(&rhs);
 
-      __half d_residual = __hmin(
-        a_residual_ptr[N - 1],
-        b_residual_ptr[N - 1]);
+      __half d_residual = PropagateNaN ? __hmin_nan(a_residual_ptr[N - 1], b_residual_ptr[N - 1])
+                                       : __hmin(a_residual_ptr[N - 1], b_residual_ptr[N - 1]);
 
       result[N - 1] = reinterpret_cast<half_t const &>(d_residual);
     }
 
     #else
 
+    minimum<half_t,PropagateNaN> mn;
+
     CUTLASS_PRAGMA_UNROLL
     for (int i = 0; i < N; ++i) {
-      result[i] = (rhs[i] < lhs[i] ? rhs[i] : lhs[i]);
+      result[i] = mn(lhs[i],rhs[i]);
     }
     #endif
 
@@ -2079,24 +2002,26 @@ struct minimum<Array<half_t, N>, false> {
 
     CUTLASS_PRAGMA_UNROLL
     for (int i = 0; i < N / 2; ++i) {
-      result_ptr[i] = __hmin2(lhs_pair, rhs_ptr[i]);
+      result_ptr[i] = PropagateNaN ? __hmin2_nan(lhs_pair, rhs_ptr[i])
+                                   : __hmin2(lhs_pair, rhs_ptr[i]);
     }
 
     if constexpr (N % 2) {
       __half const *b_residual_ptr = reinterpret_cast<__half const *>(&rhs);
 
-      __half d_residual = __hmin(
-        reinterpret_cast<__half const &>(lhs),
-        b_residual_ptr[N - 1]);
+      __half d_residual = PropagateNaN ? __hmin_nan(reinterpret_cast<__half const &>(lhs), b_residual_ptr[N - 1])
+                                       : __hmin(reinterpret_cast<__half const &>(lhs), b_residual_ptr[N - 1]);
 
       result[N - 1] = reinterpret_cast<half_t const &>(d_residual);
     }
 
     #else
 
+    minimum<half_t,PropagateNaN> mn;
+
     CUTLASS_PRAGMA_UNROLL
     for (int i = 0; i < N; ++i) {
-      result[i] = (rhs[i] < lhs ? rhs[i] : lhs);
+      result[i] = mn(lhs, rhs[i]);
     }
     #endif
 
@@ -2114,24 +2039,26 @@ struct minimum<Array<half_t, N>, false> {
 
     CUTLASS_PRAGMA_UNROLL
     for (int i = 0; i < N / 2; ++i) {
-      result_ptr[i] = __hmin2(lhs_ptr[i], rhs_pair);
+      result_ptr[i] = PropagateNaN ? __hmin2_nan(lhs_ptr[i], rhs_pair)
+                                   : __hmin2(lhs_ptr[i], rhs_pair);
     }
 
     if constexpr (N % 2) {
       __half const *a_residual_ptr = reinterpret_cast<__half const *>(&lhs);
 
-      __half d_residual = __hmin(
-        a_residual_ptr[N - 1],
-        reinterpret_cast<__half const &>(rhs));
+      __half d_residual = PropagateNaN ? __hmin_nan(a_residual_ptr[N - 1], reinterpret_cast<__half const &>(rhs))
+                                       : __hmin(a_residual_ptr[N - 1], reinterpret_cast<__half const &>(rhs));
 
       result[N - 1] = reinterpret_cast<half_t const &>(d_residual);
     }
 
     #else
 
+    minimum<half_t, PropagateNaN> mn;
+
     CUTLASS_PRAGMA_UNROLL
     for (int i = 0; i < N; ++i) {
-      result[i] = (rhs < lhs[i] ? rhs : lhs[i]);
+      result[i] = mn(lhs[i], rhs);
     }
     #endif
 
@@ -2139,8 +2066,8 @@ struct minimum<Array<half_t, N>, false> {
   }
 };
 
-template <int N>
-struct maximum<Array<half_t, N>, false> {
+template <int N, bool PropagateNaN>
+struct maximum<Array<half_t, N>, PropagateNaN> {
   CUTLASS_HOST_DEVICE
   Array<half_t, N> operator()(Array<half_t, N> const & lhs, Array<half_t, N> const &rhs) const {
     Array<half_t, N> result;
@@ -2152,25 +2079,27 @@ struct maximum<Array<half_t, N>, false> {
 
     CUTLASS_PRAGMA_UNROLL
     for (int i = 0; i < N / 2; ++i) {
-      result_ptr[i] = __hmax2(lhs_ptr[i], rhs_ptr[i]);
+      result_ptr[i] = PropagateNaN ? __hmax2_nan(lhs_ptr[i], rhs_ptr[i])
+                                   : __hmax2(lhs_ptr[i], rhs_ptr[i]);
     }
 
     if constexpr (N % 2) {
       __half const *a_residual_ptr = reinterpret_cast<__half const *>(&lhs);
       __half const *b_residual_ptr = reinterpret_cast<__half const *>(&rhs);
 
-      __half d_residual = __hmax(
-        a_residual_ptr[N - 1],
-        b_residual_ptr[N - 1]);
+      __half d_residual = PropagateNaN ? __hmax(a_residual_ptr[N - 1], b_residual_ptr[N - 1])
+                                       : __hmax_nan(a_residual_ptr[N - 1], b_residual_ptr[N - 1]);
 
       result[N - 1] = reinterpret_cast<half_t const &>(d_residual);
     }
 
     #else
 
+    maximum<half_t,PropagateNaN> mx;
+
     CUTLASS_PRAGMA_UNROLL
     for (int i = 0; i < N; ++i) {
-      result[i] = (lhs[i] < rhs[i] ? rhs[i] : lhs[i]);
+      result[i] = mx(lhs[i], rhs[i]);
     }
     #endif
 
@@ -2188,24 +2117,26 @@ struct maximum<Array<half_t, N>, false> {
 
     CUTLASS_PRAGMA_UNROLL
     for (int i = 0; i < N / 2; ++i) {
-      result_ptr[i] = __hmax2(lhs_pair, rhs_ptr[i]);
+      result_ptr[i] = PropagateNaN ? __hmax2_nan(lhs_pair, rhs_ptr[i])
+                                   : __hmax2(lhs_pair, rhs_ptr[i]);
     }
 
     if constexpr (N % 2) {
       __half const *b_residual_ptr = reinterpret_cast<__half const *>(&rhs);
 
-      __half d_residual = __hmax(
-        reinterpret_cast<__half const &>(lhs),
-        b_residual_ptr[N - 1]);
+      __half d_residual = PropagateNaN ? __hmax_nan(reinterpret_cast<__half const &>(lhs), b_residual_ptr[N - 1])
+                                       : __hmax(reinterpret_cast<__half const &>(lhs), b_residual_ptr[N - 1]);
 
       result[N - 1] = reinterpret_cast<half_t const &>(d_residual);
     }
 
     #else
 
+    maximum<half_t,PropagateNaN> mx;
+
     CUTLASS_PRAGMA_UNROLL
     for (int i = 0; i < N; ++i) {
-      result[i] = (lhs < rhs[i] ? rhs[i] : lhs);
+      result[i] = mx(lhs, rhs[i]);
     }
     #endif
 
@@ -2223,24 +2154,26 @@ struct maximum<Array<half_t, N>, false> {
 
     CUTLASS_PRAGMA_UNROLL
     for (int i = 0; i < N / 2; ++i) {
-      result_ptr[i] = __hmax2(lhs_ptr[i], rhs_pair);
+      result_ptr[i] = PropagateNaN ? __hmax2_nan(lhs_ptr[i], rhs_pair)
+                                   : __hmax2(lhs_ptr[i], rhs_pair);
     }
 
     if constexpr (N % 2) {
       __half const *a_residual_ptr = reinterpret_cast<__half const *>(&lhs);
 
-      __half d_residual = __hmax(
-        a_residual_ptr[N - 1],
-        reinterpret_cast<__half const &>(rhs));
+      __half d_residual = PropagateNaN ? __hmax_nan(a_residual_ptr[N - 1], reinterpret_cast<__half const &>(rhs))
+                                       : __hmax(a_residual_ptr[N - 1], reinterpret_cast<__half const &>(rhs));
 
       result[N - 1] = reinterpret_cast<half_t const &>(d_residual);
     }
 
     #else
 
+    maximum<half_t,PropagateNaN> mx;
+
     CUTLASS_PRAGMA_UNROLL
     for (int i = 0; i < N; ++i) {
-      result[i] = (lhs[i] < rhs ? rhs : lhs[i]);
+      result[i] = mx(lhs[i], rhs);
     }
     #endif
 
diff --git a/include/cutlass/barrier.h b/include/cutlass/barrier.h
index 1cfc73c1cb..cb6218880c 100644
--- a/include/cutlass/barrier.h
+++ b/include/cutlass/barrier.h
@@ -281,7 +281,7 @@ struct NamedBarrierManager {
   CUTLASS_DEVICE
   static void
   check_barrier_in_range([[maybe_unused]] uint32_t idx) {
-    assert((idx >= MaxNumNamedBarriers) && "Index exceeds barrier count");
+    assert((idx < MaxNumNamedBarriers) && "Index exceeds barrier count");
   }
 
   template <uint32_t... Idx>
diff --git a/include/cutlass/bfloat16.h b/include/cutlass/bfloat16.h
index 30b9c8403b..87b1f2f42f 100644
--- a/include/cutlass/bfloat16.h
+++ b/include/cutlass/bfloat16.h
@@ -199,6 +199,14 @@ struct alignas(2) bfloat16_t {
     return (float(*this) != 0.0f);
   }
 
+#if !defined(CUTLASS_ENABLE_SYCL)
+  /// Bitcasts to CUDA's bf16 type
+  CUTLASS_DEVICE
+  __nv_bfloat16 to_nv_bfloat16() const {
+    return reinterpret_cast<__nv_bfloat16 const &>(storage);
+  }
+#endif
+
   /// Obtains raw bits
   CUTLASS_HOST_DEVICE
   uint16_t raw() const {
@@ -330,9 +338,9 @@ bfloat16_t copysign(bfloat16_t const& a, bfloat16_t const& b) {
 //
 ///////////////////////////////////////////////////////////////////////////////////////////////////
 
+#if !defined(__CUDACC_RTC__)
 namespace std {
 
-#if !defined(__CUDACC_RTC__)
 /// Numeric limits
 template <>
 struct numeric_limits<cutlass::bfloat16_t> {
@@ -387,9 +395,78 @@ struct numeric_limits<cutlass::bfloat16_t> {
   CUTLASS_HOST_DEVICE
   static cutlass::bfloat16_t denorm_min() { return cutlass::bfloat16_t::bitcast(0x1); }
 };
-#endif
 
 } // namespace std
+#endif
+
+namespace cutlass {
+namespace platform {
+
+/// Forward Declaration
+template <class T>
+struct numeric_limits;
+
+/// Numeric limits
+template <>
+struct numeric_limits<cutlass::bfloat16_t> {
+  static bool const is_specialized = true;
+  static bool const is_signed = true;
+  static bool const is_integer = false;
+  static bool const is_exact = false;
+  static bool const has_infinity = true;
+  static bool const has_quiet_NaN = true;
+  static bool const has_signaling_NaN = false;
+#if !defined(__CUDACC_RTC__)
+  static std::float_denorm_style const has_denorm = std::denorm_present;
+#endif
+  static bool const has_denorm_loss = true;
+#if !defined(__CUDACC_RTC__)
+  static std::float_round_style const round_style = std::round_to_nearest;
+#endif
+  static bool const is_iec559 = false;
+  static bool const is_bounded = true;
+  static bool const is_modulo = false;
+  static int const digits = 7;
+
+  /// Least positive value
+  CUTLASS_HOST_DEVICE
+  static cutlass::bfloat16_t min() { return cutlass::bfloat16_t::bitcast(0x01); }
+
+  /// Minimum finite value
+  CUTLASS_HOST_DEVICE
+  static cutlass::bfloat16_t lowest() { return cutlass::bfloat16_t::bitcast(0xff7f); }
+
+  /// Maximum finite value
+  CUTLASS_HOST_DEVICE
+  static cutlass::bfloat16_t max() { return cutlass::bfloat16_t::bitcast(0x7f7f); }
+
+  /// Returns smallest finite value
+  CUTLASS_HOST_DEVICE
+  static cutlass::bfloat16_t epsilon() { return cutlass::bfloat16_t::bitcast(0x1000); }
+
+  /// Returns smallest finite value
+  CUTLASS_HOST_DEVICE
+  static cutlass::bfloat16_t round_error() { return cutlass::bfloat16_t(0.5f); }
+
+  /// Returns smallest finite value
+  CUTLASS_HOST_DEVICE
+  static cutlass::bfloat16_t infinity() { return cutlass::bfloat16_t::bitcast(0x7f80); }
+
+  /// Returns smallest finite value
+  CUTLASS_HOST_DEVICE
+  static cutlass::bfloat16_t quiet_NaN() { return cutlass::bfloat16_t::bitcast(0x7fff); }
+
+  /// Returns smallest finite value
+  CUTLASS_HOST_DEVICE
+  static cutlass::bfloat16_t signaling_NaN() { return cutlass::bfloat16_t::bitcast(0x7fff); }
+
+  /// Returns smallest finite value
+  CUTLASS_HOST_DEVICE
+  static cutlass::bfloat16_t denorm_min() { return cutlass::bfloat16_t::bitcast(0x1); }
+};
+
+} // namespace platform
+} // namespace cutlass
 
 ///////////////////////////////////////////////////////////////////////////////////////////////////
 //
@@ -403,114 +480,190 @@ namespace cutlass {
 
 CUTLASS_HOST_DEVICE
 bool operator==(bfloat16_t const& lhs, bfloat16_t const& rhs) {
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800)
+  return __heq(lhs.to_nv_bfloat16(), rhs.to_nv_bfloat16());
+#else
   return float(lhs) == float(rhs);
+#endif
 }
 
 CUTLASS_HOST_DEVICE
 bool operator!=(bfloat16_t const& lhs, bfloat16_t const& rhs) {
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800)
+  return __hne(lhs.to_nv_bfloat16(), rhs.to_nv_bfloat16());
+#else
   return float(lhs) != float(rhs);
+#endif
 }
 
 CUTLASS_HOST_DEVICE
 bool operator<(bfloat16_t const& lhs, bfloat16_t const& rhs) {
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800)
+  return __hlt(lhs.to_nv_bfloat16(), rhs.to_nv_bfloat16());
+#else
   return float(lhs) < float(rhs);
+#endif
 }
 
 CUTLASS_HOST_DEVICE
 bool operator<=(bfloat16_t const& lhs, bfloat16_t const& rhs) {
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800)
+  return __hle(lhs.to_nv_bfloat16(), rhs.to_nv_bfloat16());
+#else
   return float(lhs) <= float(rhs);
+#endif
 }
 
 CUTLASS_HOST_DEVICE
 bool operator>(bfloat16_t const& lhs, bfloat16_t const& rhs) {
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800)
+  return __hgt(lhs.to_nv_bfloat16(), rhs.to_nv_bfloat16());
+#else
   return float(lhs) > float(rhs);
+#endif
 }
 
 CUTLASS_HOST_DEVICE
 bool operator>=(bfloat16_t const& lhs, bfloat16_t const& rhs) {
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800)
+  return __hge(lhs.to_nv_bfloat16(), rhs.to_nv_bfloat16());
+#else
   return float(lhs) >= float(rhs);
+#endif
 }
 
 CUTLASS_HOST_DEVICE
 bfloat16_t operator+(bfloat16_t const& lhs, bfloat16_t const& rhs) {
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800)
+  return bfloat16_t(__hadd(lhs.to_nv_bfloat16(), rhs.to_nv_bfloat16()));
+#else
   return bfloat16_t(float(lhs) + float(rhs));
+#endif
 }
 
 CUTLASS_HOST_DEVICE
 bfloat16_t operator-(bfloat16_t const& lhs) {
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800)
+  return bfloat16_t(__hneg(lhs.to_nv_bfloat16()));
+#else
   return bfloat16_t(-float(lhs));
+#endif
 }
 
 CUTLASS_HOST_DEVICE
 bfloat16_t operator-(bfloat16_t const& lhs, bfloat16_t const& rhs) {
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800)
+  return bfloat16_t(__hsub(lhs.to_nv_bfloat16(), rhs.to_nv_bfloat16()));
+#else
   return bfloat16_t(float(lhs) - float(rhs));
+#endif
 }
 
 CUTLASS_HOST_DEVICE
 bfloat16_t operator*(bfloat16_t const& lhs, bfloat16_t const& rhs) {
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800)
+  return bfloat16_t(__hmul(lhs.to_nv_bfloat16(), rhs.to_nv_bfloat16()));
+#else
   return bfloat16_t(float(lhs) * float(rhs));
+#endif
 }
 
 CUTLASS_HOST_DEVICE
 bfloat16_t operator/(bfloat16_t const& lhs, bfloat16_t const& rhs) {
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800)
+  return bfloat16_t(__hdiv(lhs.to_nv_bfloat16(), rhs.to_nv_bfloat16()));
+#else
   return bfloat16_t(float(lhs) / float(rhs));
+#endif
 }
 
 CUTLASS_HOST_DEVICE
 bfloat16_t& operator+=(bfloat16_t & lhs, bfloat16_t const& rhs) {
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800)
+  lhs = bfloat16_t(__hadd(lhs.to_nv_bfloat16(), rhs.to_nv_bfloat16()));
+#else
   lhs = bfloat16_t(float(lhs) + float(rhs));
+#endif
   return lhs;
 }
 
 CUTLASS_HOST_DEVICE
 bfloat16_t& operator-=(bfloat16_t & lhs, bfloat16_t const& rhs) {
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800)
+  lhs = bfloat16_t(__hsub(lhs.to_nv_bfloat16(), rhs.to_nv_bfloat16()));
+#else
   lhs = bfloat16_t(float(lhs) - float(rhs));
+#endif
   return lhs;
 }
 
 CUTLASS_HOST_DEVICE
 bfloat16_t& operator*=(bfloat16_t & lhs, bfloat16_t const& rhs) {
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800)
+  lhs = bfloat16_t(__hmul(lhs.to_nv_bfloat16(), rhs.to_nv_bfloat16()));
+#else
   lhs = bfloat16_t(float(lhs) * float(rhs));
+#endif
   return lhs;
 }
 
 CUTLASS_HOST_DEVICE
 bfloat16_t& operator/=(bfloat16_t & lhs, bfloat16_t const& rhs) {
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800)
+  lhs = bfloat16_t(__hdiv(lhs.to_nv_bfloat16(), rhs.to_nv_bfloat16()));
+#else
   lhs = bfloat16_t(float(lhs) / float(rhs));
+#endif
   return lhs;
 }
 
 CUTLASS_HOST_DEVICE
 bfloat16_t& operator++(bfloat16_t & lhs) {
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800)
+  lhs = bfloat16_t(__hadd(lhs.to_nv_bfloat16(), bfloat16_t(1.0f).to_nv_bfloat16()));
+#else
   float tmp(lhs);
   ++tmp;
   lhs = bfloat16_t(tmp);
+#endif
   return lhs;
 }
 
 CUTLASS_HOST_DEVICE
 bfloat16_t& operator--(bfloat16_t & lhs) {
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800)
+  lhs = bfloat16_t(__hsub(lhs.to_nv_bfloat16(), bfloat16_t(1.0f).to_nv_bfloat16()));
+#else
   float tmp(lhs);
   --tmp;
   lhs = bfloat16_t(tmp);
+#endif
   return lhs;
 }
 
 CUTLASS_HOST_DEVICE
 bfloat16_t operator++(bfloat16_t & lhs, int) {
   bfloat16_t ret(lhs);
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800)
+  lhs = bfloat16_t(__hadd(lhs.to_nv_bfloat16(), bfloat16_t(1.0f).to_nv_bfloat16()));
+#else
   float tmp(lhs);
   tmp++;
   lhs = bfloat16_t(tmp);
+#endif
   return ret;
 }
 
 CUTLASS_HOST_DEVICE
 bfloat16_t operator--(bfloat16_t & lhs, int) {
   bfloat16_t ret(lhs);
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800)
+  lhs = bfloat16_t(__hsub(lhs.to_nv_bfloat16(), bfloat16_t(1.0f).to_nv_bfloat16()));
+#else
   float tmp(lhs);
   tmp--;
   lhs = bfloat16_t(tmp);
+#endif
   return ret;
 }
 
@@ -525,12 +678,12 @@ bfloat16_t operator--(bfloat16_t & lhs, int) {
 //
 
 CUTLASS_HOST_DEVICE
-cutlass::bfloat16_t operator "" _bf16(long double x) {
+cutlass::bfloat16_t operator ""_bf16(long double x) {
   return cutlass::bfloat16_t(float(x));
 }
 
 CUTLASS_HOST_DEVICE
-cutlass::bfloat16_t operator "" _bf16(unsigned long long int x) {
+cutlass::bfloat16_t operator ""_bf16(unsigned long long int x) {
   return cutlass::bfloat16_t(int(x));
 }
 
diff --git a/include/cutlass/cluster_launch.hpp b/include/cutlass/cluster_launch.hpp
index 07a5c80b6a..88b141c51f 100644
--- a/include/cutlass/cluster_launch.hpp
+++ b/include/cutlass/cluster_launch.hpp
@@ -174,6 +174,7 @@ struct ClusterLauncher {
         "And ClusterDims = "
         "(" << cluster_dims.x << ", " << cluster_dims.y << ", " << cluster_dims.z << ")\n");
 
+    cutlass::arch::synclog_setup();
     cudaError_t status = cudaLaunchKernelExC(&launch_config, kernel, kernel_params);
     Return_Status(status);
 #else
diff --git a/include/cutlass/conv/collective/sm90_implicit_gemm_gmma_ss_warpspecialized.hpp b/include/cutlass/conv/collective/sm90_implicit_gemm_gmma_ss_warpspecialized.hpp
index 13bb7c515c..78862b0a09 100644
--- a/include/cutlass/conv/collective/sm90_implicit_gemm_gmma_ss_warpspecialized.hpp
+++ b/include/cutlass/conv/collective/sm90_implicit_gemm_gmma_ss_warpspecialized.hpp
@@ -41,8 +41,8 @@
 #include "cute/algorithm/functional.hpp"
 #include "cute/algorithm/gemm.hpp"
 
+#include "cutlass/conv/detail.hpp"
 #include "cutlass/conv/convolution.h"
-#include "cutlass/conv/convnd_problem_shape.hpp"
 #include "cutlass/conv/dispatch_policy.hpp"
 #include "cutlass/pipeline/pipeline.hpp"
 #include "cutlass/util/packed_stride.hpp"
@@ -103,6 +103,8 @@ struct CollectiveConv<
 
   using PipelineParams = typename MainloopPipeline::Params;
   using PipelineState  = typename cutlass::PipelineState<DispatchPolicy::Stages>;
+  
+  using ProblemShape = ConvProblemShape<ConvOp, NumSpatialDimensions>;
 
   // TODO: move pipeline mode tiling into the collective setup phase instead
   static_assert(rank(SmemLayoutA{}) == 3, "SmemLayout must be rank 3 (M/N, K, PIPE)");
@@ -143,7 +145,7 @@ struct CollectiveConv<
 
   struct SharedStorage
   {
-    struct TensorStorage : cute::aligned_struct<128> {
+    struct TensorStorage : cute::aligned_struct<128, _0> {
       cute::array_aligned<typename TiledMma::ValTypeA, cute::cosize_v<SmemLayoutA>> smem_A;
       cute::array_aligned<typename TiledMma::ValTypeB, cute::cosize_v<SmemLayoutB>> smem_B;
     } tensors;
@@ -162,8 +164,6 @@ struct CollectiveConv<
 
   // Host side kernel arguments
   struct Arguments {
-    using ProblemShape = ConvProblemShape<ConvOp, NumSpatialDimensions>;
-    ProblemShape problem_shape{};
     ElementA const* ptr_A{nullptr};
     ElementB const* ptr_B{nullptr};
   };
@@ -175,7 +175,7 @@ struct CollectiveConv<
   // Get tma_load_a instantce.
   template <class TensorA>
   static constexpr auto
-  get_tma_load_a_instance(TensorA const& tensor_a, typename Arguments::ProblemShape const& problem_shape) {
+  get_tma_load_a_instance(TensorA const& tensor_a, ProblemShape const& problem_shape) {
     if constexpr (is_im2col_A) {
       // compute the upper and lower corners based on the conv padding
       auto lower_corner_whd = detail::compute_lower_corner_whd(problem_shape);
@@ -218,7 +218,7 @@ struct CollectiveConv<
   // Get tma_load_b instantce.
   template <class TensorB>
   static constexpr auto
-  get_tma_load_b_instance(TensorB const& tensor_b, typename Arguments::ProblemShape const& problem_shape) {
+  get_tma_load_b_instance(TensorB const& tensor_b, ProblemShape const& problem_shape) {
     // TMA im2col mode for tensor B in wgrad kernel.
     if constexpr (is_im2col_B) {
       // compute the upper and lower corners based on the conv padding
@@ -250,24 +250,25 @@ struct CollectiveConv<
     }
   }
 
+public:
+
+  // Performs im2col transformations on the input of type ConvProblemShape
   static constexpr auto
-  get_problem_shape_MNKL(typename Arguments::ProblemShape const& problem_shape) {
+  get_problem_shape_MNKL(ProblemShape const& problem_shape) {
+
     if constexpr (is_im2col_A || is_im2col_B) {
       // transformation + im2col linearization
-      return problem_shape.get_linearized_problem_shape_MNKL();
+      return cutlass::conv::detail::get_linearized_problem_shape_MNKL(problem_shape);
     }
     else {
       // transformation
-      return problem_shape.get_transformed_problem_shape_MNKL();
+      return cutlass::conv::detail::get_transformed_problem_shape_MNKL(problem_shape);
     }
   }
 
-public:
-
   // Device side kernel params
   struct Params {
-    using _Submode = decltype(take<0,NumTensorDimensions-1>(typename Arguments::ProblemShape::TensorExtent{}));
-    using ProblemShape = decltype(get_problem_shape_MNKL(typename Arguments::ProblemShape{}));
+    using _Submode = decltype(take<0,NumTensorDimensions-1>(typename ProblemShape::TensorExtent{}));
 
     // Assumption: StrideA is congruent with Problem_MK
     // Select TMA load type according to convolution operator.
@@ -294,7 +295,6 @@ struct CollectiveConv<
     // Members
     TMA_A tma_load_a;
     TMA_B tma_load_b;
-    ProblemShape problem_shape;
     uint32_t tma_transaction_bytes = TmaTransactionBytes;
   };
 
@@ -304,19 +304,19 @@ struct CollectiveConv<
 
   // Lowers the host side user facing arguments to the kernel facing lauch params
   static constexpr Params
-  to_underlying_arguments(Arguments const& args, void* workspace) {
+  to_underlying_arguments(ProblemShape const& problem_shape, Arguments const& args, void* workspace) {
     (void) workspace;
     // from the flat problem shape arrays of ConvProblemShape<ConvOp, N>, create a rank-3 MNK problem shape tuple
     // tma desc creation depends on the original untransformed domain.
 
     // A extents.
-    auto shape_A_orig = args.problem_shape.get_shape_A();
+    auto shape_A_orig = problem_shape.get_shape_A();
     // B extents.
-    auto shape_B_orig = args.problem_shape.get_shape_B();
+    auto shape_B_orig = problem_shape.get_shape_B();
 
     // Fill inferred cute strides from flat stride arrays
-    auto dA = make_cute_packed_stride(StrideA{}, args.problem_shape.stride_A, ConvOp);
-    auto dB = make_cute_packed_stride(StrideB{}, args.problem_shape.stride_B, ConvOp);
+    auto dA = make_cute_packed_stride(StrideA{}, problem_shape.stride_A, ConvOp);
+    auto dB = make_cute_packed_stride(StrideB{}, problem_shape.stride_B, ConvOp);
 
     auto ptr_A = reinterpret_cast<InternalElementA const*>(args.ptr_A);
     auto ptr_B = reinterpret_cast<InternalElementB const*>(args.ptr_B);
@@ -324,20 +324,17 @@ struct CollectiveConv<
     Tensor tensor_a = make_tensor(make_gmem_ptr(ptr_A), make_layout(shape_A_orig, dA));
     Tensor tensor_b = make_tensor(make_gmem_ptr(ptr_B), make_layout(shape_B_orig, dB));
 
-    auto tma_load_a = get_tma_load_a_instance(tensor_a, args.problem_shape);
-    auto tma_load_b = get_tma_load_b_instance(tensor_b, args.problem_shape);
-
-    auto problem_shape_mnkl = get_problem_shape_MNKL(args.problem_shape);
+    auto tma_load_a = get_tma_load_a_instance(tensor_a, problem_shape);
+    auto tma_load_b = get_tma_load_b_instance(tensor_b, problem_shape);
 
     return {
       tma_load_a,
       tma_load_b,
-      problem_shape_mnkl,
       TmaTransactionBytes
     };
   }
-
-  template<class ProblemShape>
+  
+  template <class ProblemShape>
   static bool
   can_implement(
       ProblemShape const& problem_shape,
@@ -345,14 +342,14 @@ struct CollectiveConv<
     // Activation and Filter channel mode extents much match
     bool implementable = true;
     // channel mode is major
-    implementable &= args.problem_shape.stride_A[NumTensorDimensions-1] == 1;
-    implementable &= args.problem_shape.stride_B[NumTensorDimensions-1] == 1;
+    implementable &= problem_shape.stride_A[NumTensorDimensions-1] == 1;
+    implementable &= problem_shape.stride_B[NumTensorDimensions-1] == 1;
 
     constexpr int tma_alignment_bits = 128;
     // A extents.
-    auto shape_A_orig = args.problem_shape.get_shape_A();
+    auto shape_A_orig = problem_shape.get_shape_A();
     // B extents.
-    auto shape_B_orig = args.problem_shape.get_shape_B();
+    auto shape_B_orig = problem_shape.get_shape_B();
     constexpr int min_tma_aligned_elements_A = tma_alignment_bits / cutlass::sizeof_bits<ElementA>::value;
     implementable = implementable && cutlass::detail::check_alignment<min_tma_aligned_elements_A>(shape_A_orig, StrideA{});
     constexpr int min_tma_aligned_elements_B = tma_alignment_bits / cutlass::sizeof_bits<ElementB>::value;
@@ -390,24 +387,53 @@ struct CollectiveConv<
     cute::prefetch_tma_descriptor(mainloop_params.tma_load_b.get_tma_descriptor());
   }
 
+  /// Set up the data needed by this collective for load and mma.
+  /// Returns a tuple of tensors. The collective and the kernel layer have the contract
+  /// Returned tuple must contain at least two elements, with the first two elements being:
+  /// gA_mk - The tma tensor, A after a local tile so it has shape  (BLK_M,BLK_K,m,k)
+  /// gB_nk - The tma tensor, B after a local tile so it has shape  (BLK_N,BLK_K,n,k)
+  /// The rest of the tensors can be specified as needed by this collective.
+  /// The dimensions of gA_mk and gA_nk do not contain L to maintain consistency with 
+  /// StrideA and StrideB set up for TMA 
+  template <class ProblemShapeMNKL>
+  CUTLASS_DEVICE auto
+  load_init(ProblemShapeMNKL const& problem_shape_MNKL, Params const& mainloop_params){
+  //load_init(ProblemShapeMNKL const& problem_shape_MNKL, Params const& mainloop_params) const {
+    using X = Underscore;
+    // Separate out problem shape for convenience
+    auto [M, N, K, L] = problem_shape_MNKL;
+
+    // TMA requires special handling of strides to deal with coord codomain mapping
+    // Represent the full tensors -- get these from TMA
+    Tensor mA_mk = mainloop_params.tma_load_a.get_tma_tensor(make_shape(M,K));                            // (m,k)
+    Tensor mB_nk = mainloop_params.tma_load_b.get_tma_tensor(make_shape(N,K));                            // (n,k)
+
+    // Make tiled views, defer the slice
+    Tensor gA_mk = local_tile(mA_mk, TileShape{}, make_coord(_,_,_), Step<_1, X,_1>{});        // (BLK_M,BLK_K,m,k)
+    Tensor gB_nk = local_tile(mB_nk, TileShape{}, make_coord(_,_,_), Step< X,_1,_1>{});        // (BLK_N,BLK_K,n,k)
+
+    return cute::make_tuple(gA_mk, gB_nk);
+  }
+
   /// Perform a collective-scoped matrix multiply-accumulate
   /// Producer Perspective
   template <
-    class TensorA, class TMA_LOAD_A,
-    class TensorB, class TMA_LOAD_B,
-    class KTileIterator
+    class TensorA, class TensorB,
+    class KTileIterator, class BlockCoord
   >
   CUTLASS_DEVICE void
-  load(MainloopPipeline pipeline, 
-       PipelineState smem_pipe_producer_state,
-       TensorA const& gA, TMA_LOAD_A& tma_load_a,
-       TensorB const& gB, TMA_LOAD_B& tma_load_b,
-       KTileIterator k_tile_iter, int k_tile_count,
-       int thread_idx,
-       uint32_t block_rank_in_cluster,
-       TensorStorage& shared_tensors) {
-    int lane_predicate = cute::elect_one_sync();
+  load(
+      Params const& mainloop_params,
+      MainloopPipeline pipeline,
+      PipelineState smem_pipe_producer_state,
+      cute::tuple<TensorA, TensorB> const& load_inputs,
+      BlockCoord const& blk_coord,
+      KTileIterator k_tile_iter, int k_tile_count,
+      int thread_idx,
+      uint32_t block_rank_in_cluster,
+      TensorStorage& shared_tensors) {
 
+    int lane_predicate = cute::elect_one_sync();
     if (lane_predicate) {
       Tensor sA = make_tensor(make_smem_ptr(shared_tensors.smem_A.data()), SmemLayoutA{});        // (BLK_M,BLK_K,PIPE)
       Tensor sB = make_tensor(make_smem_ptr(shared_tensors.smem_B.data()), SmemLayoutB{});        // (BLK_N,BLK_K,PIPE)
@@ -415,11 +441,19 @@ struct CollectiveConv<
       //
       // Prepare the TMA loads for A and B
       //
-
       constexpr uint32_t cluster_shape_x = get<0>(ClusterShape());
+
       uint2 cluster_local_block_id = {block_rank_in_cluster % cluster_shape_x, block_rank_in_cluster / cluster_shape_x};
-      auto block_tma_a = tma_load_a.get_slice(cluster_local_block_id.y);
-      auto block_tma_b = tma_load_b.get_slice(cluster_local_block_id.x);
+      auto block_tma_a = mainloop_params.tma_load_a.get_slice(cluster_local_block_id.y);
+      auto block_tma_b = mainloop_params.tma_load_b.get_slice(cluster_local_block_id.x);
+
+      auto [gA_mk, gB_nk] = load_inputs;
+
+      // Partition the inputs based on the current block coordinates.
+      auto [m_coord, n_coord, k_coord, l_coord] = blk_coord;
+
+      Tensor gA = gA_mk(_,_,m_coord,_);                                                     // (BLK_M,BLK_K,k)
+      Tensor gB = gB_nk(_,_,n_coord,_);                                                     // (BLK_N,BLK_K,k)
 
       // Applies the mapping from block_tma_a
       Tensor tAgA = block_tma_a.partition_S(gA);                                                 // (TMA,TMA_M,TMA_K,k)
@@ -463,8 +497,9 @@ struct CollectiveConv<
         BarrierType* tma_barrier = pipeline.producer_get_barrier(smem_pipe_producer_state);
 
         int write_stage = smem_pipe_producer_state.index();
-        copy(tma_load_a.with(*tma_barrier, mcast_mask_a), tAgA(_,_,_,*k_tile_iter), tAsA(_,_,_,write_stage));
-        copy(tma_load_b.with(*tma_barrier, mcast_mask_b), tBgB(_,_,_,*k_tile_iter), tBsB(_,_,_,write_stage));
+
+        copy(mainloop_params.tma_load_a.with(*tma_barrier, mcast_mask_a), tAgA(_,_,_,*k_tile_iter), tAsA(_,_,_,write_stage));
+        copy(mainloop_params.tma_load_b.with(*tma_barrier, mcast_mask_b), tBgB(_,_,_,*k_tile_iter), tBsB(_,_,_,write_stage));
         ++k_tile_iter;
 
         // Advance smem_pipe_producer_state
diff --git a/include/cutlass/conv/convnd_problem_shape.hpp b/include/cutlass/conv/convnd_problem_shape.hpp
index 0172120538..ffcc547fbd 100644
--- a/include/cutlass/conv/convnd_problem_shape.hpp
+++ b/include/cutlass/conv/convnd_problem_shape.hpp
@@ -43,6 +43,7 @@
 #include <initializer_list>
 #endif
 
+
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
 namespace cutlass::conv {
@@ -54,15 +55,17 @@ namespace cutlass::conv {
 // Supports asymmetric padding, traversal strides, dilations, and all conv algorithm types.
 template <
   conv::Operator ConvOp_,
-  int NumSpatialDimensions
+  int NumSpatialDimensions_
 >
 struct ConvProblemShape {
   //
   // Alias types for members
   //
-  static constexpr int RankS = NumSpatialDimensions;
-  static constexpr int RankT = NumSpatialDimensions + 2;
+
+  static constexpr int RankS = NumSpatialDimensions_;
+  static constexpr int RankT = NumSpatialDimensions_ + 2;
   static constexpr conv::Operator ConvOp = ConvOp_;
+  static constexpr int NumSpatialDimensions = NumSpatialDimensions_;
   using SpatialExtent = cute::array<int, RankS>;
   using TensorExtent  = cute::array<int, RankT>;
   using TensorStride  = cute::array<int64_t, RankT>;
@@ -352,71 +355,6 @@ struct ConvProblemShape {
     }
   }
 
-  // Get problem shape MNKL according to following table:
-  // |               |   Fprop   |   Dgrad         |   Wgrad   |
-  // |   ----        | --------- | --------        | --------  |
-  // |   Shape_M     | (Q,P,Z,N) | (W/V,H/U,D/O,N) | (K)       |
-  // |   Shape_N     | (K)       | (C)             | (C,S,R,T) |
-  // |   Shape_K     | (C,S,R,T) | (K,S,R,T)       | (Q,P,Z,N) |
-  // |   Shape_L     | _1        | (V,U,O)         | _1        |
-  CUTLASS_HOST_DEVICE
-  constexpr auto
-  get_transformed_problem_shape_MNKL() const {
-    using cute::insert;
-    using cute::make_shape;
-    using cute::reverse;
-    using cute::take;
-
-    if constexpr (ConvOp == conv::Operator::kWgrad) {
-      auto M_xformed = shape_C[0];
-      auto N_xformed = reverse(take<1, RankT>(shape_C));
-      auto K_xformed = reverse(take<0, RankT - 1>(shape_A));
-      auto L_xformed = cute::Int<1>{};
-
-      return make_shape(M_xformed, N_xformed, K_xformed, L_xformed);
-    }
-    else if constexpr (ConvOp == conv::Operator::kFprop){
-      auto M_xformed = reverse(take<0, RankT - 1>(shape_C));
-      auto N_xformed = shape_C[RankT - 1];
-      auto K_xformed = reverse(take<1, RankT>(shape_B));
-      auto L_xformed = cute::Int<1>{};
-
-      return make_shape(M_xformed, N_xformed, K_xformed, L_xformed);
-    }
-    else if constexpr (ConvOp == conv::Operator::kDgrad) {
-      auto L_xformed = reverse(traversal_stride); // (V,U,O)
-      auto M_xformed = ceil_div(reverse(take<0,RankT - 1>(shape_C)), L_xformed);
-      auto N_xformed = shape_C[RankT - 1];
-      // shape_B: [K,T,R,S,C], K_xformed: [K,S,R,T]
-      auto K_xformed = insert<0>(
-                  (reverse(take<1,RankT - 1>(shape_B))),
-                  shape_B[0]);
-
-      return make_shape(M_xformed, N_xformed, K_xformed, L_xformed);
-    }
-  }
-
-  // Assuming im2col linearization
-  // Get problem shape MNKL according to following table:
-  // |               |   Fprop   |   Dgrad               |   Wgrad   |
-  // |   ----        | --------- | --------              | --------  |
-  // |   Shape_M     | (Q*P*Z*N) | ([W/V]*[H/U]*[D/O]*N) | (K)       |
-  // |   Shape_N     | (K)       | (C)                   | (C,S,R,T) |
-  // |   Shape_K     | (C,S,R,T) | (K,S,R,T)             | (Q*P*Z*N) |
-  // |   Shape_L     | _1        | (V*U*O)               | _1        |
-  CUTLASS_HOST_DEVICE
-  constexpr auto
-  get_linearized_problem_shape_MNKL() const {
-    auto [M, N, K, L] = get_transformed_problem_shape_MNKL();
-
-    if constexpr (ConvOp == conv::Operator::kFprop || ConvOp == conv::Operator::kDgrad) {
-      return cute::make_shape(cute::product(M), N, K, cute::product(L));
-    }
-    else if constexpr (ConvOp == conv::Operator::kWgrad) {
-      return cute::make_shape(M, N, cute::product(K), L);
-    }
-  }
-
   // Get A extents.
   // fprop: A extents array contains [N,D,H,W,C]. Turn that into ((W,H,D,N), (C))
   // dgrad: A extents array contains [N,Z,P,Q,K]. Turn that into ((Q,P,Z,N), (K))
@@ -578,9 +516,7 @@ struct ConvProblemShape {
     // calculate n,z,p,q,k.
     // a helper lambda to compute a single spatial extent of the nzpqk tensor
     auto nzpqk_extent = [](int act_ext, int filter_ext, int pad_total, int dilation, int tstride) {
-      auto tmp = act_ext + pad_total - ((filter_ext -1) * dilation + 1);
-      CUTLASS_ASSERT(tmp % tstride == 0);
-      return 1 + tmp / tstride;
+      return 1 + (act_ext + pad_total - ((filter_ext -1) * dilation + 1)) / tstride;
     };
 
     shape_xformed_act[0] = shape_act[0]; // Activation N extent
diff --git a/include/cutlass/conv/detail.hpp b/include/cutlass/conv/detail.hpp
new file mode 100644
index 0000000000..3e4173569c
--- /dev/null
+++ b/include/cutlass/conv/detail.hpp
@@ -0,0 +1,137 @@
+
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+#pragma once
+
+#include "cutlass/conv/convnd_problem_shape.hpp"
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+namespace cutlass::conv::detail {
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+  // Helper function to get the problem shape
+template <typename T, class ProblemShape>
+auto get_problem_shape_MNKL_helper(ProblemShape const& problem_shape, cute::true_type) {
+  return T::get_problem_shape_MNKL(problem_shape);
+}
+
+template <typename T, class ProblemShape>
+ProblemShape get_problem_shape_MNKL_helper(ProblemShape const& problem_shape, cute::false_type) {
+  return problem_shape;
+}
+
+// Get problem shape MNKL according to following table:
+// |               |   Fprop   |   Dgrad         |   Wgrad   |
+// |   ----        | --------- | --------        | --------  |
+// |   Shape_M     | (Q,P,Z,N) | (W/V,H/U,D/O,N) | (K)       |
+// |   Shape_N     | (K)       | (C)             | (C,S,R,T) |
+// |   Shape_K     | (C,S,R,T) | (K,S,R,T)       | (Q,P,Z,N) |
+// |   Shape_L     | _1        | (V,U,O)         | _1        |
+
+template <class ProblemShape>
+CUTLASS_HOST_DEVICE
+constexpr auto
+get_transformed_problem_shape_MNKL(ProblemShape const& problem_shape) {
+  return problem_shape;
+}
+
+
+template <conv::Operator ConvOp, int SpatialDim>
+CUTLASS_HOST_DEVICE
+constexpr auto
+get_transformed_problem_shape_MNKL(ConvProblemShape<ConvOp, SpatialDim> const& problem_shape) {
+  using cute::insert;
+  using cute::make_shape;
+  using cute::reverse;
+  using cute::take;
+
+  constexpr int RankT = SpatialDim + 2;
+
+  if constexpr (ConvOp == conv::Operator::kWgrad) {
+    auto M_xformed = problem_shape.shape_C[0];
+    auto N_xformed = reverse(take<1, RankT>(problem_shape.shape_C));
+    auto K_xformed = reverse(take<0, RankT - 1>(problem_shape.shape_A));
+    auto L_xformed = cute::Int<1>{};
+
+    return make_shape(M_xformed, N_xformed, K_xformed, L_xformed);
+  }
+  else if constexpr (ConvOp == conv::Operator::kFprop){
+    auto M_xformed = reverse(take<0, RankT - 1>(problem_shape.shape_C));
+    auto N_xformed = problem_shape.shape_C[RankT - 1];
+    auto K_xformed = reverse(take<1, RankT>(problem_shape.shape_B));
+    auto L_xformed = cute::Int<1>{};
+
+    return make_shape(M_xformed, N_xformed, K_xformed, L_xformed);
+  }
+  else if constexpr (ConvOp == conv::Operator::kDgrad) {
+    auto L_xformed = reverse(problem_shape.traversal_stride); // (V,U,O)
+    auto M_xformed = ceil_div(reverse(take<0,RankT - 1>(problem_shape.shape_C)), L_xformed);
+    auto N_xformed = problem_shape.shape_C[RankT - 1];
+    // shape_B: [K,T,R,S,C], K_xformed: [K,S,R,T]
+    auto K_xformed = insert<0>(
+                (reverse(take<1,RankT - 1>(problem_shape.shape_B))),
+                problem_shape.shape_B[0]);
+
+    return make_shape(M_xformed, N_xformed, K_xformed, L_xformed);
+  }
+}
+
+// Assuming im2col linearization
+// Get problem shape MNKL according to following table:
+// |               |   Fprop   |   Dgrad               |   Wgrad   |
+// |   ----        | --------- | --------              | --------  |
+// |   Shape_M     | (Q*P*Z*N) | ([W/V]*[H/U]*[D/O]*N) | (K)       |
+// |   Shape_N     | (K)       | (C)                   | (C,S,R,T) |
+// |   Shape_K     | (C,S,R,T) | (K,S,R,T)             | (Q*P*Z*N) |
+// |   Shape_L     | _1        | (V*U*O)               | _1        |
+template <conv::Operator ConvOp, int SpatialDim>
+CUTLASS_HOST_DEVICE
+constexpr auto
+get_linearized_problem_shape_MNKL(ConvProblemShape<ConvOp, SpatialDim> const& problem_shape) {
+
+  auto [M, N, K, L] = get_transformed_problem_shape_MNKL(problem_shape);
+
+  if constexpr (ConvOp == conv::Operator::kFprop || ConvOp == conv::Operator::kDgrad) {
+    return cute::make_shape(cute::product(M), N, K, cute::product(L));
+  }
+  else if constexpr (ConvOp == conv::Operator::kWgrad) {
+    return cute::make_shape(M, N, cute::product(K), L);
+  }
+
+}
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
+
+} // namespace cutlass::conv::detail
+
+////////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/include/cutlass/conv/device/conv_universal_adapter.hpp b/include/cutlass/conv/device/conv_universal_adapter.hpp
index 0472b898c2..193f8d8854 100644
--- a/include/cutlass/conv/device/conv_universal_adapter.hpp
+++ b/include/cutlass/conv/device/conv_universal_adapter.hpp
@@ -61,7 +61,7 @@ template <class ConvKernel_>
 class ConvUniversalAdapter
 {
 public:
-  using ConvKernel = ConvKernel_;
+  using ConvKernel = GetUnderlyingKernel_t<ConvKernel_>;
   using TileShape = typename ConvKernel::TileShape;
   using ElementA = typename ConvKernel::ElementA;
   using ElementB = typename ConvKernel::ElementB;
@@ -76,7 +76,7 @@ class ConvUniversalAdapter
 
   // Tease out meta-information about the conv algorithm
   static constexpr conv::Operator kConvolutionalOperator = DispatchPolicy::ConvOp;
-  static constexpr int NumSpatialDimensions = ConvKernel::NumSpatialDimensions;
+  static constexpr int NumSpatialDimensions = CollectiveMainloop::NumSpatialDimensions;
 
   // If our TiledMMA's instruction thread layout size is larger than 1, we know its a tensorop!
   using OperatorClass = cute::conditional_t<
@@ -121,13 +121,13 @@ class ConvUniversalAdapter
   static int constexpr kStages = CollectiveMainloop::DispatchPolicy::Stages;
 
   // Inspect TiledCopy for A and B to compute the alignment size
-  static int constexpr kAlignmentA = detail::get_alignment_count_from_gmem_tiled_copy<
+  static int constexpr kAlignmentA = cutlass::detail::get_alignment_count_from_gmem_tiled_copy<
       typename CollectiveMainloop::GmemTiledCopyA, ElementA>();
-  static int constexpr kAlignmentB = detail::get_alignment_count_from_gmem_tiled_copy<
+  static int constexpr kAlignmentB = cutlass::detail::get_alignment_count_from_gmem_tiled_copy<
       typename CollectiveMainloop::GmemTiledCopyB, ElementB>();
-  static int constexpr kAlignmentC = detail::get_alignment_count_from_gmem_tiled_copy<
+  static int constexpr kAlignmentC = cutlass::detail::get_alignment_count_from_gmem_tiled_copy<
       typename CollectiveEpilogue::GmemTiledCopyC, ElementC>();
-  static int constexpr kAlignmentD = detail::get_alignment_count_from_gmem_tiled_copy<
+  static int constexpr kAlignmentD = cutlass::detail::get_alignment_count_from_gmem_tiled_copy<
       typename CollectiveEpilogue::GmemTiledCopyD, ElementD>();
 
   using EpilogueOutputOp = typename CollectiveEpilogue::ThreadEpilogueOp;
@@ -297,8 +297,9 @@ class ConvUniversalAdapter
     Status launch_result;
     // Use extended launch API only for mainloops that use it
     if constexpr (ConvKernel::ArchTag::kMinComputeCapability >= 90) {
-      constexpr bool is_static_1x1x1 = cute::is_static_v<typename ConvKernel::DispatchPolicy::ClusterShape> and
-                                       cute::size(typename ConvKernel::DispatchPolicy::ClusterShape{}) == 1;
+      [[maybe_unused]] constexpr bool is_static_1x1x1 =
+        cute::is_static_v<typename ConvKernel::DispatchPolicy::ClusterShape> and
+        cute::size(typename ConvKernel::DispatchPolicy::ClusterShape{}) == 1;
       dim3 cluster(cute::size<0>(typename ConvKernel::DispatchPolicy::ClusterShape{}),
                    cute::size<1>(typename ConvKernel::DispatchPolicy::ClusterShape{}),
                    cute::size<2>(typename ConvKernel::DispatchPolicy::ClusterShape{}));
diff --git a/include/cutlass/conv/device/direct_convolution.h b/include/cutlass/conv/device/direct_convolution.h
index 84953d8036..43ab94b5fc 100644
--- a/include/cutlass/conv/device/direct_convolution.h
+++ b/include/cutlass/conv/device/direct_convolution.h
@@ -211,6 +211,7 @@ class DirectConvolution {
       dim3 grid = ReorderKernel::get_grid_shape(params_);
       dim3 block = ReorderKernel::get_block_shape();
 
+      cutlass::arch::synclog_setup();
       cutlass::Kernel<ReorderKernel><<<grid, block, 0, stream>>>(params_);
     }
 
@@ -229,6 +230,7 @@ class DirectConvolution {
     if (status != cudaSuccess)
       return Status::kErrorInternal;
 
+    cutlass::arch::synclog_setup();
     cutlass::Kernel<UnderlyingKernel><<<grid, block, smem_size, stream>>>(params_);
 
     cudaError_t result = cudaGetLastError();
diff --git a/include/cutlass/conv/device/implicit_gemm_convolution.h b/include/cutlass/conv/device/implicit_gemm_convolution.h
index 62c7e8715d..a1cb06e98f 100644
--- a/include/cutlass/conv/device/implicit_gemm_convolution.h
+++ b/include/cutlass/conv/device/implicit_gemm_convolution.h
@@ -53,7 +53,7 @@ template<typename ImplicitGemmKernel_>
 class ImplicitGemmConvolution {
 public:
 
-  using UnderlyingKernel = ImplicitGemmKernel_;
+  using UnderlyingKernel = GetUnderlyingKernel_t<ImplicitGemmKernel_>;
 
   using ElementA = typename UnderlyingKernel::ElementA;
   using LayoutA = typename UnderlyingKernel::LayoutA;
@@ -103,7 +103,6 @@ class ImplicitGemmConvolution {
 
   /// Determines whether the Implicit GEMM can execute the given problem.
   static Status can_implement(Arguments const &args) {
-
     // dispatch to iterators
     Status status = UnderlyingKernel::Mma::IteratorA::can_implement(args.problem_size);
     if (Status::kSuccess != status) {
@@ -164,9 +163,8 @@ class ImplicitGemmConvolution {
     // check for unsupported problem sizes for strided dgrad / deconv implementation
     if ((kConvolutionalOperator == conv::Operator::kDgrad || kConvolutionalOperator == conv::Operator::kDeconv) &&
       kStrideSupport == conv::StrideSupport::kStrided) {
-
       // split-k (serial or parallel) is not supported for strided dgrad / deconv
-      if(args.problem_size.split_k_slices > 1) {
+      if(args.problem_size.split_k_slices > 1 && (args.problem_size.stride().at(args.problem_size.stride().max_dim_index()) > 1)) {
         return Status::kErrorNotSupported;
       }
 
@@ -291,7 +289,7 @@ class ImplicitGemmConvolution {
   }
 
   /// Runs the kernel using initialized state.
-  Status run(cudaStream_t stream = nullptr, CudaHostAdapter *cuda_adapter = nullptr) {
+  Status run(cudaStream_t stream = nullptr, CudaHostAdapter *cuda_adapter = nullptr, int32_t kernel_index = 0) {
 
 
     ThreadblockSwizzle threadblock_swizzle;
@@ -311,7 +309,7 @@ class ImplicitGemmConvolution {
 
           void* kernel_params[] = {&params_};
           launch_result = cuda_adapter->launch(
-              grid, dim3(1,1,1), block, smem_size, stream, kernel_params, 0
+              grid, dim3(1,1,1), block, smem_size, stream, kernel_params, kernel_index
               );
         }
         else {
@@ -319,6 +317,7 @@ class ImplicitGemmConvolution {
         }
     }
     else {
+      cutlass::arch::synclog_setup();
       cutlass::Kernel<UnderlyingKernel><<<grid, block, smem_size, stream>>>(params_);      
     }
 
@@ -333,20 +332,20 @@ class ImplicitGemmConvolution {
   }
 
   /// Runs the kernel using initialized state.
-  Status operator()(cudaStream_t stream = nullptr, CudaHostAdapter *cuda_adapter = nullptr) {
-    return run(stream, cuda_adapter);
+  Status operator()(cudaStream_t stream = nullptr, CudaHostAdapter *cuda_adapter = nullptr, int32_t kernel_index = 0) {
+    return run(stream, cuda_adapter, kernel_index);
   }
 
   /// Runs the kernel using initialized state.
   Status operator()(
     Arguments const &args, 
     void *workspace = nullptr, 
-    cudaStream_t stream = nullptr, CudaHostAdapter *cuda_adapter = nullptr) {
+    cudaStream_t stream = nullptr, CudaHostAdapter *cuda_adapter = nullptr, int32_t kernel_index = 0) {
     
     Status status = initialize(args, workspace, stream, cuda_adapter);
     
     if (status == Status::kSuccess) {
-      status = run(stream, cuda_adapter);
+      status = run(stream, cuda_adapter, kernel_index);
     }
 
     return status;
diff --git a/include/cutlass/conv/device/implicit_gemm_convolution_fusion.h b/include/cutlass/conv/device/implicit_gemm_convolution_fusion.h
index 1eb0d5600e..265156cc5b 100644
--- a/include/cutlass/conv/device/implicit_gemm_convolution_fusion.h
+++ b/include/cutlass/conv/device/implicit_gemm_convolution_fusion.h
@@ -231,6 +231,7 @@ class ImplicitGemmConvolutionFusion {
 
     int smem_size = int(sizeof(typename ImplicitGemmFusionKernel::SharedStorage));
 
+    cutlass::arch::synclog_setup();
     cutlass::Kernel<ImplicitGemmFusionKernel><<<grid, block, smem_size, stream>>>(params_);
 
     cudaError_t result = cudaGetLastError();
diff --git a/include/cutlass/conv/dispatch_policy.hpp b/include/cutlass/conv/dispatch_policy.hpp
index 039f4539c4..b8b5eb2bff 100644
--- a/include/cutlass/conv/dispatch_policy.hpp
+++ b/include/cutlass/conv/dispatch_policy.hpp
@@ -37,6 +37,8 @@
 #include "cute/layout.hpp"
 #include "cute/numeric/integral_constant.hpp"
 
+#include "cutlass/gemm/dispatch_policy.hpp"
+
 //////////////////////////////////////////////////////////////////////////////
 
 //////////////////////////////////////////////////////////////////////////////
@@ -48,7 +50,7 @@ namespace cutlass::conv {
 //
 // Policies for categorical dispatch of mainloop against kernel grid schedules
 //
-struct KernelImplicitTmaWarpSpecializedSm90 { };
+struct KernelImplicitTmaWarpSpecializedSm90 : cutlass::gemm::KernelTmaWarpSpecialized { };
 struct KernelImplicitTmaWarpSpecializedSm90Cooperative { };
 struct KernelImplicitTmaWarpSpecializedSm90Pingpong { };
 
@@ -84,3 +86,5 @@ struct MainloopSm90TmaGmmaWarpSpecializedImplicitGemm {
 //////////////////////////////////////////////////////////////////////////////
 
 } // namespace cutlass::conv 
+
+//////////////////////////////////////////////////////////////////////////////
diff --git a/include/cutlass/conv/kernel/conv_universal.hpp b/include/cutlass/conv/kernel/conv_universal.hpp
index 9d98dc9d96..23ccea2f8f 100644
--- a/include/cutlass/conv/kernel/conv_universal.hpp
+++ b/include/cutlass/conv/kernel/conv_universal.hpp
@@ -30,6 +30,7 @@
  **************************************************************************************************/
 #pragma once
 
+#include "cutlass/conv/convnd_problem_shape.hpp"
 #include "cutlass/detail/dependent_false.hpp"
 
 ////////////////////////////////////////////////////////////////////////////////
@@ -43,6 +44,7 @@ namespace cutlass::conv::kernel {
  * a composition of a collective mainloop and a collective epilogue.
 **/
 template <
+  class ProblemShape_,
   class CollectiveMainloop_,
   class CollectiveEpilogue_,
   class TileSchedulerTag_ = void,
diff --git a/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp b/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp
index 95780bf84e..657ac6b3ec 100644
--- a/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp
+++ b/include/cutlass/conv/kernel/sm90_implicit_gemm_tma_warpspecialized.hpp
@@ -37,9 +37,12 @@
 #include "cute/tensor.hpp"
 #include "cute/arch/cluster_sm90.hpp"
 
+#include "cutlass/conv/detail.hpp"
 #include "cutlass/conv/convolution.h"
 #include "cutlass/conv/dispatch_policy.hpp"
+#include "cutlass/gemm/dispatch_policy.hpp"
 #include "cutlass/pipeline/sm90_pipeline.hpp"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
 #include "cutlass/gemm/kernel/tile_scheduler.hpp"
 
 ///////////////////////////////////////////////////////////////////////////////
@@ -49,365 +52,25 @@ namespace cutlass::conv::kernel {
 ///////////////////////////////////////////////////////////////////////////////
 
 template <
+  class ProblemShape_,
   class CollectiveMainloop_,
   class CollectiveEpilogue_,
-  class TileSchedulerTag_
+  class TileScheduler_
 >
 class ConvUniversal<
+  ProblemShape_,
   CollectiveMainloop_,
   CollectiveEpilogue_,
-  TileSchedulerTag_,
-  cute::enable_if_t<cute::is_base_of_v<cutlass::conv::KernelImplicitTmaWarpSpecializedSm90,
-                                       typename CollectiveMainloop_::DispatchPolicy::Schedule>>>
-{
-public:
-  //
-  // Type Aliases
-  //
-
-  // Mainloop derived types
-  using CollectiveMainloop = CollectiveMainloop_;
-  using TileShape = typename CollectiveMainloop::TileShape;
-  using TiledMma  = typename CollectiveMainloop::TiledMma;
-  using ArchTag   = typename CollectiveMainloop::ArchTag;
-  using ElementA  = typename CollectiveMainloop::ElementA;
-  using StrideA   = typename CollectiveMainloop::StrideA;
-  using ElementB  = typename CollectiveMainloop::ElementB;
-  using StrideB   = typename CollectiveMainloop::StrideB;
-  using DispatchPolicy = typename CollectiveMainloop::DispatchPolicy;
-  using ElementAccumulator = typename CollectiveMainloop::ElementAccumulator;
-  using ClusterShape = typename DispatchPolicy::ClusterShape;
-  using MainloopArguments = typename CollectiveMainloop::Arguments;
-  using MainloopParams = typename CollectiveMainloop::Params;
-  static constexpr int NumSpatialDimensions = CollectiveMainloop::NumSpatialDimensions;
-  static_assert(ArchTag::kMinComputeCapability >= 90);
-  // Epilogue derived types
-  using CollectiveEpilogue = CollectiveEpilogue_;
-  using ElementC = typename CollectiveEpilogue::ElementC;
-  using StrideC  = typename CollectiveEpilogue::StrideC;
-  using ElementD = typename CollectiveEpilogue::ElementD;
-  using StrideD  = typename CollectiveEpilogue::StrideD;
-  using EpilogueArguments = typename CollectiveEpilogue::Arguments;
-  using EpilogueParams = typename CollectiveEpilogue::Params;
-
-  using TileSchedulerTag = TileSchedulerTag_;
-  static_assert(cute::is_void_v<TileSchedulerTag>,
-    "TMA warp-specialized kernel does not support specializing the tile scheduler.");
-  using TileScheduler = typename cutlass::gemm::kernel::detail::TileSchedulerSelector<
-      TileSchedulerTag, ArchTag, TileShape, ClusterShape>::Scheduler;
-  using TileSchedulerArguments = typename TileScheduler::Arguments;
-
-  // Kernel level shared memory storage
-  struct SharedStorage {
-    union TensorStorage {
-      using MainloopTensorStorage = typename CollectiveMainloop::TensorStorage;
-      using EpilogueTensorStorage = typename CollectiveEpilogue::TensorStorage;
-
-      MainloopTensorStorage mainloop;
-      EpilogueTensorStorage epilogue;
-    } tensors;
-
-    struct PipelineStorage : cute::aligned_struct<16> {
-      using MainloopPipelineStorage = typename CollectiveMainloop::PipelineStorage;
-      using EpiLoadPipelineStorage = typename CollectiveEpilogue::PipelineStorage;
-
-      alignas(16) MainloopPipelineStorage mainloop;
-      alignas(16) EpiLoadPipelineStorage epi_load;
-    } pipelines;
-  };
-
-  static constexpr int SharedStorageSize = sizeof(SharedStorage);
-  static constexpr uint32_t NumLoadWarpGroups = 1;
-  static constexpr uint32_t NumMmaWarpGroups = 1;
-  static constexpr uint32_t MaxThreadsPerBlock = CUTE_STATIC_V(size(TiledMma{})) + (NumLoadWarpGroups * NumThreadsPerWarpGroup);
-  static constexpr uint32_t MinBlocksPerMultiprocessor = 1;
-
-  // Host facing host arguments
-  struct Arguments {
-    MainloopArguments mainloop{};
-    EpilogueArguments epilogue{};
-    KernelHardwareInfo hw_info{};
-    TileSchedulerArguments scheduler{};
-  };
-
-  // Kernel device entry point API
-  struct Params {
-    MainloopParams mainloop;
-    EpilogueParams epilogue;
-  };
-
-  //
-  // Methods
-  //
-
-  // Map user facing arguments to device facing params
-  static Params
-  to_underlying_arguments(Arguments const& args, void* workspace) {
-    (void) workspace;
-    auto mainloop_params = CollectiveMainloop::to_underlying_arguments(args.mainloop, workspace);
-    auto problem_shape_MNKL = args.mainloop.problem_shape.get_transformed_problem_shape_MNKL();
-
-    return {
-      mainloop_params,
-      CollectiveEpilogue::to_underlying_arguments(problem_shape_MNKL, args.epilogue, workspace)
-    };
-  }
-
-  // Given arguemnts, returns true if the kernel can successfully compute upon them. False otherwise.
-  static bool
-  can_implement(Arguments const& args) {
-    bool implementable = true;
-    implementable &= CollectiveMainloop::can_implement(args.mainloop.problem_shape, args.mainloop);
-    implementable &= CollectiveEpilogue::can_implement(args.mainloop.problem_shape.get_transformed_problem_shape_MNKL(), args.epilogue);
-    return implementable;
-  }
-
-  static size_t
-  get_workspace_size(Arguments const& args) {
-    return 0;
-  }
-
-  static cutlass::Status
-  initialize_workspace(Arguments const& args, void* workspace = nullptr, cudaStream_t stream = nullptr,
-    CudaHostAdapter* cuda_adapter = nullptr) {
-    return Status::kSuccess;
-  }
-
-  // Computes the kernel launch grid shape based on runtime parameters
-  static dim3
-  get_grid_shape(Params const& params) {
-    return cutlass::gemm::kernel::detail::PersistentTileSchedulerSm90::get_tiled_cta_shape_mnl(
-        params.mainloop.problem_shape, TileShape{}, ClusterShape{});
-  }
-
-  static dim3
-  get_block_shape() {
-    return dim3(MaxThreadsPerBlock, 1, 1);
-  }
-
-  CUTLASS_DEVICE
-  void
-  operator()(Params const& params, char* smem_buf) {
-    using namespace cute;
-    using X = Underscore;
-
-    // Any Tensor Op MMA Atom in the WGMMA ISA is arch conditional to sm90a.
-    #if ! defined(__CUDA_ARCH_FEAT_SM90_ALL)
-      if constexpr(size<0>(typename TiledMma::AtomShape_MNK{}) == 64) {
-        printf("ERROR : Arch conditional MMA instruction used without targeting sm90a compute capability. Aborting.\n");
-        return;
-      }
-    #endif
-
-    enum class WarpGroupRole {
-      Producer = 0,
-      Consumer = 1,
-    };
-
-    enum class ProducerWarpRole {
-      MainloopEpilogue = 0,
-      Warp1 = 1,
-      Warp2 = 2,
-      Warp3 = 3
-    };
-    
-    // Kernel level shared memory storage
-    SharedStorage& shared_storage = *reinterpret_cast<SharedStorage*>(smem_buf);
-
-    int thread_idx = int(threadIdx.x);
-    int lane_idx = canonical_lane_idx();
-    int warp_idx   = canonical_warp_idx_sync();
-    int warp_idx_in_warp_group = warp_idx % NumWarpsPerWarpGroup;
-    int warp_group_thread_idx = thread_idx % NumThreadsPerWarpGroup;
-    auto warp_group_role = WarpGroupRole(canonical_warp_group_idx());
-    auto producer_warp_role = ProducerWarpRole(warp_idx_in_warp_group);
-    int lane_predicate = cute::elect_one_sync();
-    uint32_t block_rank_in_cluster = cute::block_rank_in_cluster();
-
-    // Issue Tma Descriptor Prefetch from a single thread
-    if ((warp_idx == 0) && lane_predicate) {
-      CollectiveMainloop::prefetch_tma_descriptors(params.mainloop);
-      CollectiveEpilogue::prefetch_tma_descriptors(params.epilogue);
-    }
-
-    // Mainloop Load pipeline
-    using MainloopPipeline = typename CollectiveMainloop::MainloopPipeline;
-    typename MainloopPipeline::Params mainloop_pipeline_params;
-    if (warp_group_role == WarpGroupRole::Producer && producer_warp_role == ProducerWarpRole::MainloopEpilogue) {
-      mainloop_pipeline_params.role = MainloopPipeline::ThreadCategory::Producer;
-    }
-    if (warp_group_role == WarpGroupRole::Consumer) {
-      mainloop_pipeline_params.role = MainloopPipeline::ThreadCategory::Consumer;
-    }
-    mainloop_pipeline_params.is_leader = warp_group_thread_idx == 0;
-    mainloop_pipeline_params.num_consumers = NumThreadsPerWarpGroup;
-    mainloop_pipeline_params.transaction_bytes = params.mainloop.tma_transaction_bytes;
-    MainloopPipeline mainloop_pipeline(shared_storage.pipelines.mainloop, mainloop_pipeline_params, ClusterShape{});
-
-    // Epilogue Load pipeline
-    using EpiLoadPipeline = typename CollectiveEpilogue::LoadPipeline;
-    typename EpiLoadPipeline::Params epi_load_pipeline_params;
-    if (warp_group_role == WarpGroupRole::Producer && producer_warp_role == ProducerWarpRole::MainloopEpilogue) {
-      epi_load_pipeline_params.role = EpiLoadPipeline::ThreadCategory::Producer;
-    }
-    if (warp_group_role == WarpGroupRole::Consumer) {
-      epi_load_pipeline_params.role = EpiLoadPipeline::ThreadCategory::Consumer;
-    }
-    epi_load_pipeline_params.dst_blockid = cute::block_rank_in_cluster();
-    epi_load_pipeline_params.producer_arv_count = NumThreadsPerWarp;
-    epi_load_pipeline_params.consumer_arv_count = NumThreadsPerWarpGroup;
-    if constexpr (CollectiveEpilogue::RequiresTransactionBytes) {
-      epi_load_pipeline_params.transaction_bytes = params.epilogue.tma_transaction_bytes;
-    }
-    EpiLoadPipeline epi_load_pipeline(shared_storage.pipelines.epi_load, epi_load_pipeline_params);
-
-    // Epilogue Store pipeline
-    using EpiStorePipeline = typename CollectiveEpilogue::StorePipeline;
-    typename EpiStorePipeline::Params epi_store_pipeline_params;
-    epi_store_pipeline_params.always_wait = true;
-    EpiStorePipeline epi_store_pipeline(epi_store_pipeline_params);
-
-    // Initialize starting pipeline states for the collectives
-    // Epilogue store pipe is producer-only (consumer is TMA unit, waits via scoreboarding)
-    typename CollectiveMainloop::PipelineState mainloop_pipe_consumer_state;
-    typename CollectiveEpilogue::LoadPipelineState epi_load_pipe_consumer_state;
-
-    // For the DMA Load (producer) we start with an opposite phase
-    // i.e., we skip all waits since we know that the buffer is indeed empty
-    PipelineState mainloop_pipe_producer_state = cutlass::make_producer_start_state<MainloopPipeline>();
-    PipelineState epi_load_pipe_producer_state = cutlass::make_producer_start_state<EpiLoadPipeline>();
-    PipelineState epi_store_pipe_producer_state = cutlass::make_producer_start_state<EpiStorePipeline>();
-
-    auto cluster_wait_fn = [&] () {
-      // We need this to guarantee that the Pipeline init is visible
-      // To all producers and consumer thread blocks in the Cluster
-      if constexpr (size(ClusterShape{}) > 1) {
-        cute::cluster_arrive_relaxed();
-        return [] () { cute::cluster_wait(); };
-      }
-      else {
-        __syncthreads();
-        return [] () {}; // do nothing
-      }
-    } ();
-
-    // Separate out problem shape for convenience
-    auto problem_shape_MNKL = append<4>(params.mainloop.problem_shape, _1{});
-    auto [M, N, K, L] = problem_shape_MNKL;
-
-    // TMA requires special handling of strides to deal with coord codomain mapping
-    // Represent the full tensors -- get these from TMA
-    Tensor mA_mk = params.mainloop.tma_load_a.get_tma_tensor(make_shape(M, K));
-    Tensor mB_nk = params.mainloop.tma_load_b.get_tma_tensor(make_shape(N, K));
-
-    // Get the appropriate blocks for this thread block -- potential for thread block locality
-    auto cta_tile_shape = TileShape{};                                                           // (BLK_M,BLK_N,BLK_K)
-    TiledMma tiled_mma;
-
-    // Make tiled views, defer the slice
-    Tensor gA_mk = local_tile(mA_mk, cta_tile_shape, make_coord(_,_,_), Step<_1, X,_1>{});         // (BLK_M,BLK_K,m,k)
-    Tensor gB_nk = local_tile(mB_nk, cta_tile_shape, make_coord(_,_,_), Step< X,_1,_1>{});         // (BLK_N,BLK_K,n,k)
-
-    // Compute m_coord, n_coord, and l_coord with their post-tiled shapes
-    auto m_coord = idx2crd(int(blockIdx.x), shape<2>(gA_mk));
-    auto n_coord = idx2crd(int(blockIdx.y), shape<2>(gB_nk), compact_col_major(shape<2>(gB_nk)));
-
-    // The output shape M is linearized so the output coord M here should also be linearized.
-    auto output_tile_coord = make_coord(int(blockIdx.x), n_coord, _, Int<0>{});
-
-    // Slice with m_coord and n_coord
-    Tensor gA = gA_mk(_,_,m_coord,_);                                                                // (BLK_M,BLK_K,k)
-    Tensor gB = gB_nk(_,_,n_coord,_);                                                                // (BLK_N,BLK_K,k)
-
-    // Get pipeline iterators and increments from tensor shapes
-    auto k_tile_iter  = cute::make_coord_iterator(shape<2>(gA));
-    auto k_tile_count = size<2>(gA);
-
-    // In a warp specialized kernel, collectives expose data movement and compute operations separately
-    CollectiveMainloop collective_mainloop;
-    CollectiveEpilogue collective_epilogue{params.epilogue, shared_storage.tensors.epilogue};
-
-    // Wait for all thread blocks in Cluster
-    cluster_wait_fn();
-
-    if (warp_group_role == WarpGroupRole::Producer) {
-      if (producer_warp_role == ProducerWarpRole::MainloopEpilogue) {
-        collective_mainloop.load(
-          mainloop_pipeline,
-          mainloop_pipe_producer_state,
-          gA, params.mainloop.tma_load_a,
-          gB, params.mainloop.tma_load_b,
-          k_tile_iter, k_tile_count,
-          lane_idx,
-          block_rank_in_cluster,
-          shared_storage.tensors.mainloop
-        );
-        // Update starting mainloop pipeline state for the pipeline drain
-        mainloop_pipe_producer_state.advance(k_tile_count);
-        // Make sure mainloop consumer has been waited upon before issuing epilogue load
-        collective_mainloop.load_tail(mainloop_pipeline, mainloop_pipe_producer_state);
-      
-        if (collective_epilogue.is_producer_load_needed()) {
-          epi_load_pipe_producer_state = collective_epilogue.load(
-            epi_load_pipeline,
-            epi_load_pipe_producer_state,
-            problem_shape_MNKL,
-            cta_tile_shape,
-            output_tile_coord,
-            tiled_mma,
-            lane_idx,
-            shared_storage.tensors.epilogue
-          );
-          collective_epilogue.load_tail(epi_load_pipeline, epi_load_pipe_producer_state);
-        }
-      }
-    }
-    else if (warp_group_role == WarpGroupRole::Consumer) {
-      Tensor accumulators = partition_fragment_C(tiled_mma, take<0,2>(cta_tile_shape));            // (MMA,MMA_M,MMA_N)
-
-      collective_mainloop.mma(
-        mainloop_pipeline,
-        mainloop_pipe_consumer_state,
-        accumulators,
-        k_tile_count,
-        thread_idx,
-        shared_storage.tensors.mainloop,
-        params.mainloop
-      );
-
-      // Make sure the math instructions are done and free buffers before entering the epilogue
-      collective_mainloop.mma_tail(
-        mainloop_pipeline,
-        mainloop_pipe_consumer_state,
-        k_tile_count
-      );
-
-      // Epilogue and write to gD
-      auto [epi_load_pipe_consumer_state_next, epi_store_pipe_producer_state_next] =
-      collective_epilogue.store(
-        epi_load_pipeline,
-        epi_load_pipe_consumer_state,
-        epi_store_pipeline,
-        epi_store_pipe_producer_state,
-        problem_shape_MNKL,
-        cta_tile_shape,
-        output_tile_coord,
-        accumulators,
-        tiled_mma,
-        warp_group_thread_idx,
-        shared_storage.tensors.epilogue
-      );
-
-      collective_epilogue.store_tail(
-        epi_load_pipeline,
-        epi_load_pipe_consumer_state_next,
-        epi_store_pipeline,
-        epi_store_pipe_producer_state_next
-      );
-    }
-  }
-};
-
+  TileScheduler_,
+  cute::enable_if_t<cute::is_base_of_v<KernelImplicitTmaWarpSpecializedSm90, typename CollectiveMainloop_::DispatchPolicy::Schedule>>
+> : public cutlass::gemm::kernel::GemmUniversal< 
+  ProblemShape_, 
+  CollectiveMainloop_, 
+  CollectiveEpilogue_, 
+  TileScheduler_
+>
+{};
 ///////////////////////////////////////////////////////////////////////////////
 
 } // namespace cutlass::conv::kernel
+
diff --git a/include/cutlass/cuda_host_adapter.hpp b/include/cutlass/cuda_host_adapter.hpp
index 63dbc93807..2c5f61d6ed 100644
--- a/include/cutlass/cuda_host_adapter.hpp
+++ b/include/cutlass/cuda_host_adapter.hpp
@@ -82,6 +82,80 @@
 
 namespace cutlass {
 
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+
+#if !defined(__CUDACC_RTC__) && !defined(CUTLASS_ENABLE_SYCL)
+
+#include <cudaTypedefs.h>
+#include <driver_types.h>
+
+#define CUTLASS_CUDA_DRIVER_STRINGIFY(tok) #tok
+
+#if defined(CUTLASS_ENABLE_DIRECT_CUDA_DRIVER_CALL)
+
+#define CUTLASS_CUDA_DRIVER_WRAPPER_DECL(func, ver) \
+  template <typename... Args>                       \
+  CUresult call_##func(Args... args) {              \
+    return func(args...);                           \
+  }
+
+#else // defined(CUTLASS_ENABLE_DIRECT_CUDA_DRIVER_CALL)
+
+#if (__CUDACC_VER_MAJOR__ >= 12 && __CUDACC_VER_MINOR__ >= 5)
+
+#define CUTLASS_CUDA_DRIVER_WRAPPER_DECL(func, ver)             \
+  template <typename... Args>                                   \
+  CUresult call_##func(Args... args) {                          \
+    cudaDriverEntryPointQueryResult cuda_status;                \
+    void* pfn = nullptr;                                        \
+    cudaError_t cuda_err = cudaGetDriverEntryPointByVersion(    \
+        CUTLASS_CUDA_DRIVER_STRINGIFY(func),                    \
+        &pfn, ver,                                              \
+        cudaEnableDefault,                                      \
+        &cuda_status);                                          \
+    if (cuda_status != cudaDriverEntryPointSuccess ||           \
+        cuda_err != cudaSuccess) {                              \
+      return CUDA_ERROR_UNKNOWN;                                \
+    }                                                           \
+    return reinterpret_cast<PFN_##func##_v##ver>(pfn)(args...); \
+  }
+
+#else
+
+#define CUTLASS_CUDA_DRIVER_WRAPPER_DECL(func, ver)             \
+  template <typename... Args>                                   \
+  CUresult call_##func(Args... args) {                          \
+    cudaDriverEntryPointQueryResult cuda_status;                \
+    void* pfn = nullptr;                                        \
+    cudaError_t cuda_err = cudaGetDriverEntryPoint(             \
+        CUTLASS_CUDA_DRIVER_STRINGIFY(func),                    \
+        &pfn,                                                   \
+        cudaEnableDefault,                                      \
+        &cuda_status);                                          \
+    if (cuda_status != cudaDriverEntryPointSuccess ||           \
+        cuda_err != cudaSuccess) {                              \
+      return CUDA_ERROR_UNKNOWN;                                \
+    }                                                           \
+    return reinterpret_cast<PFN_##func>(pfn)(args...);          \
+  }
+
+#endif // (__CUDACC_VER_MAJOR__ >= 12 && __CUDACC_VER_MINOR__ >= 5)
+
+#endif // defined(CUTLASS_ENABLE_DIRECT_CUDA_DRIVER_CALL)
+
+#if (__CUDACC_VER_MAJOR__ >= 12)
+CUTLASS_CUDA_DRIVER_WRAPPER_DECL(cuTensorMapEncodeTiled, 12000);
+CUTLASS_CUDA_DRIVER_WRAPPER_DECL(cuTensorMapEncodeIm2col, 12000);
+#endif
+
+#undef CUTLASS_CUDA_DRIVER_STRINGIFY
+
+#define CUTLASS_CUDA_DRIVER_WRAPPER_CALL(func) cutlass::call_##func
+
+#endif // !defined(__CUDACC_RTC__)
+
+
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
 /// This class manages runtime CUlaunchAttribute that can be supplied to CudaHostAdapter
diff --git a/include/cutlass/cutlass.h b/include/cutlass/cutlass.h
index fbf6276f90..84b0455e31 100644
--- a/include/cutlass/cutlass.h
+++ b/include/cutlass/cutlass.h
@@ -35,6 +35,7 @@
 
 #pragma once
 
+#include "cutlass/arch/synclog.hpp"
 #include "cutlass/detail/helper_macros.hpp"
 #include <cutlass/gpu_generics.h>
 
diff --git a/include/cutlass/detail/collective.hpp b/include/cutlass/detail/collective.hpp
index d3c4c04b74..a4b288e7c9 100644
--- a/include/cutlass/detail/collective.hpp
+++ b/include/cutlass/detail/collective.hpp
@@ -31,7 +31,6 @@
 #pragma once
 
 #include "cute/container/tuple.hpp"
-
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
 namespace cutlass::gemm::collective {
diff --git a/include/cutlass/detail/helper_macros.hpp b/include/cutlass/detail/helper_macros.hpp
index 96d259d5eb..280c63939a 100644
--- a/include/cutlass/detail/helper_macros.hpp
+++ b/include/cutlass/detail/helper_macros.hpp
@@ -104,6 +104,44 @@ CUTLASS_HOST_DEVICE void __CUTLASS_UNUSED(T const &)
   #endif
 #endif
 
+// CUTLASS_CMATH_NAMESPACE is the namespace where code can find
+// <cmath> functions like isnan and log.  Such functions are in
+// the std namespace in host code, but in the global namespace
+// in device code.
+//
+// The intended use case for this macro is in "using" declarations
+// for making argument-dependent lookup (ADL) work in generic code.
+// For example, if T is cutlass::half_t, the following code will
+// invoke cutlass::isnan(half_t).  If T is float, it will invoke
+// std::isnan on host and ::isnan on device.  (CUTLASS's support
+// for NVRTC prevents it from using things in the std namespace
+// in device code.)  Correct use of "using" declarations can help
+// avoid unexpected implicit conversions, like from half_t to float.
+//
+// template<class T>
+// bool foo(T x) {
+//   using CUTLASS_CMATH_NAMESPACE :: isnan;
+//   return isnan(x);
+// }
+//
+// Without this macro, one would need to write the following.
+//
+// template<class T>
+// bool foo(T x) {
+// #if defined(__CUDA_ARCH__)
+//   using ::isnan;
+// #else
+//   using std::isnan;
+// #endif
+//   return isnan(x);
+// }
+
+#if defined(__CUDA_ARCH__)
+#  define CUTLASS_CMATH_NAMESPACE
+#else
+#  define CUTLASS_CMATH_NAMESPACE std
+#endif
+
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
 namespace cutlass {
diff --git a/include/cutlass/detail/layout.hpp b/include/cutlass/detail/layout.hpp
index 429e5c2f06..cbed61f683 100644
--- a/include/cutlass/detail/layout.hpp
+++ b/include/cutlass/detail/layout.hpp
@@ -30,13 +30,17 @@
  **************************************************************************************************/
 #pragma once
 
+#include "cute/layout.hpp"
+#include "cute/pointer_sparse.hpp"       // cute::is_sparse
+#include "cute/swizzle.hpp"              // cute::Swizzle
+#include "cute/swizzle_layout.hpp"       // cute::detail::get_swizzle_portion
+#include "cute/util/type_traits.hpp"
+#include "cute/arch/copy_sm90_tma.hpp"
 #include "cutlass/layout/matrix.h"
 #include "cutlass/layout/tensor.h"
 #include "cutlass/numeric_types.h"
+#include "cutlass/detail/collective.hpp"
 
-#include "cute/layout.hpp"
-#include "cute/util/type_traits.hpp"
-#include "cute/arch/copy_sm90_tma.hpp"
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
 namespace cutlass::detail {
@@ -194,12 +198,28 @@ is_major(Stride = {}) {
   return cute::is_constant<1, decltype(cute::front(cute::get<ModeIndex>(cute::remove_pointer_t<Stride>{})))>::value;
 }
 
+template<int ModeIndex, class Shape, class Stride>
+constexpr bool
+is_major(cute::Layout<Shape,Stride> = {}) {
+  return is_major<ModeIndex>(Stride{});
+}
+
 // Note : This method can be used for deducing the Layout Tag of A, C, D Matrices
 template<class StrideA>
 constexpr
 auto
 stride_to_layout_tag_A() {
-  if constexpr (is_major<0, StrideA>()) { // M major
+  using InternalStrideA = cute::remove_pointer_t<StrideA>;
+  if constexpr (cute::is_layout<InternalStrideA>::value) {
+    return stride_to_layout_tag_A<decltype(cute::stride(InternalStrideA{}))>();
+  }
+  else if constexpr (is_major<0, StrideA>()) { // M major
+    return layout::ColumnMajor{};
+  }
+  // Specialize for sparse layout
+  else if constexpr (cute::get<0>(InternalStrideA{}) == cute::_2{} && 
+                     cute::rank(cute::get<1>(InternalStrideA{})) == 2 && 
+                     cute::is_same_v<cute::_1, cute::remove_cvref_t<decltype(cute::get<1,0>(InternalStrideA{}))>>) {
     return layout::ColumnMajor{};
   }
   else { // K major
@@ -213,7 +233,11 @@ template<class StrideB>
 constexpr
 auto
 stride_to_layout_tag_B() {
-  if constexpr (is_major<0, StrideB>()) { // N major
+  using InternalStrideB = cute::remove_pointer_t<StrideB>;
+  if constexpr (cute::is_layout<InternalStrideB>::value) {
+    return stride_to_layout_tag_B<decltype(cute::stride(InternalStrideB{}))>();
+  }
+  else if constexpr (is_major<0, StrideB>()) { // N major
     return layout::RowMajor{};
   }
   else { // K major
@@ -227,7 +251,11 @@ template<class StrideC>
 constexpr
 auto
 stride_to_layout_tag_C() {
-  if constexpr (is_major<0, StrideC>()) { // M major
+  using InternalStrideC = cute::remove_pointer_t<StrideC>;
+  if constexpr (cute::is_layout<InternalStrideC>::value) {
+    return stride_to_layout_tag_C<decltype(cute::stride(InternalStrideC{}))>();
+  }
+  else if constexpr (is_major<0, StrideC>()) { // M major
     return layout::ColumnMajor{};
   }
   else { // N major
@@ -309,6 +337,10 @@ get_alignment_count_from_gmem_tiled_copy() {
   else {
     // For TMA tiled copies, we know the alignment has to be 128 bits
     if constexpr (is_tma_copy_engine<GmemTiledCopy>()) {
+      // For sparse MMA, alignment in logical elements is increased by sparsity factor
+      if constexpr (cute::is_sparse_v<ElementMma>) {
+        return 128 / sizeof_bits<Element>::value * ElementMma::sparsity;
+      }
       return 128 / sizeof_bits<Element>::value;
     }
     else {
@@ -334,29 +366,26 @@ get_output_alignment_bits() {
   return 128;
 }
 
-
-// Return the shape that is associated with stride-1 mode, or 1 if not found
-template<typename Shape, typename Stride>
+// Check if tensor layout satisfies a given major alignment
+template<int Alignment, class Shape, class Stride>
 CUTLASS_HOST_DEVICE constexpr
-auto
-get_contiguous_shape(Shape const & shape, Stride const & stride) {
-  using namespace cute;
-  auto idx = find_if(append(flatten(stride), _1{}), [](auto s){ return is_constant<1,decltype(s)>{}; });
-  return get<decltype(idx)::value>(append(flatten(shape), _1{}));
+bool
+check_alignment(cute::Layout<Shape,Stride> const& layout) {
+  // Condition: shape must divide by Alignment without rounding
+  bool shape_check = cute::size(layout.shape()) == Alignment * cute::size(cute::upcast<Alignment>(layout));
+  // Condition: every dynamic stride must be a multiple of Alignment
+  bool stride_check = cute::all_of(cute::flatten(layout.stride()), [](auto s){ return cute::is_static<decltype(s)>::value || (s % Alignment == 0); });
+  return shape_check && stride_check;
 }
 
-// Check if tensor shape satisfies a given major alignment
+// Check if tensor layout satisfies a given major alignment
 template<int Alignment, class Shape, class Stride>
 CUTLASS_HOST_DEVICE constexpr
 bool
-check_alignment(Shape const & shape, Stride const & stride) {
-  return is_major<0>(stride)
-    ? get_contiguous_shape(cute::get<0>(shape), cute::get<0>(stride)) % Alignment == 0
-    : get_contiguous_shape(cute::get<1>(shape), cute::get<1>(stride)) % Alignment == 0;
+check_alignment(Shape const& shape, Stride const& stride) {
+  return check_alignment<Alignment>(cute::make_layout(shape, stride));
 }
 
-// Check if tensor shape satisfies a given major alignment
-
 template<int B, int M, int S>
 CUTLASS_HOST_DEVICE constexpr
 size_t
diff --git a/include/cutlass/detail/mma.hpp b/include/cutlass/detail/mma.hpp
index 058f5fd3ea..0e491b9c40 100644
--- a/include/cutlass/detail/mma.hpp
+++ b/include/cutlass/detail/mma.hpp
@@ -42,6 +42,11 @@ namespace cutlass::detail {
 template <class TiledMma, class = void>
 struct IsSparseTensorOp : cute::false_type { };
 
+// TiledMma for sparse must have ValTypeE
+template <class TiledMma>
+struct IsSparseTensorOp<TiledMma, cute::void_t<typename TiledMma::ValTypeE>>
+    : cute::true_type { };
+
 // The following metafunction is used to extract the OperatorClass from a cutlass 3.x kernel.
 template <class TiledMma>
 struct get_operator_class {
diff --git a/include/cutlass/device_kernel.h b/include/cutlass/device_kernel.h
index c45c06dd07..8670246e34 100644
--- a/include/cutlass/device_kernel.h
+++ b/include/cutlass/device_kernel.h
@@ -56,6 +56,13 @@
 
 namespace cutlass {
 
+template <typename T>   struct Type2Type  {  using type=T;                    };
+// using the simple type to replace the complex type to reduce this symbol size
+template <typename  T>                                                                        struct GetUnderlyingKernel                              : public Type2Type<T>               {};
+template <uint64_t shader_guid, unsigned index, template <uint64_t, unsigned> class Wrapper > struct GetUnderlyingKernel<Wrapper<shader_guid,index>>  : public Wrapper<shader_guid,index> {};
+template <typename  T>                                                                        using  GetUnderlyingKernel_t                            = typename GetUnderlyingKernel<T>::type;
+
+
 ////////////////////////////////////////////////////////////////////////////////
 
 /// Generic CUTLASS kernel template.
@@ -77,6 +84,7 @@ void Kernel(typename Operator::Params params) {
   Operator op;
 
   op(params, *shared_storage);
+  cutlass::arch::synclog_print();
 }
 
 
@@ -97,6 +105,8 @@ void Kernel2(typename Operator::Params params) {
       reinterpret_cast<typename Operator::SharedStorage *>(SharedStorageBase);
 
   Operator::invoke(params, *shared_storage);
+  cutlass::arch::synclog_print();
+
 }
 
 
@@ -123,6 +133,8 @@ void device_kernel(CUTLASS_GRID_CONSTANT typename Operator::Params const params)
 #endif
   Operator op;
   op(params, smem);
+  cutlass::arch::synclog_print();
+
 }
 
 ////////////////////////////////////////////////////////////////////////////////
diff --git a/include/cutlass/epilogue/collective/builders/sm90_builder.inl b/include/cutlass/epilogue/collective/builders/sm90_builder.inl
index 2ca62c9794..759591b5dc 100644
--- a/include/cutlass/epilogue/collective/builders/sm90_builder.inl
+++ b/include/cutlass/epilogue/collective/builders/sm90_builder.inl
@@ -71,14 +71,18 @@ sm90_get_tma_dispatch_policy() {
   // 8b residuals load fast and consume little smem, so the perf cost of waiting on stores to finish outweighs the cost of extra allocation
   constexpr bool ReuseSmem = (sizeof_bits_v<ElementC> == sizeof_bits_v<ElementD>) && (sizeof_bits_v<ElementD> > 8);
   // TMA store delay performs worse with residual loads and compilicates tensormap updates for Ptr-Array GEMMs
-  constexpr bool DelayTmaStore = is_void_v<ElementC> && !detail::sm90_is_tma_ptr_array_v<Schedule>;
+  constexpr bool DelayTmaStore = is_void_v<ElementC> && !detail::sm90_is_ptr_array_tma_v<Schedule>;
   constexpr int StagesD = cute::min(EpiTiles, 2);
   constexpr int StagesC = ReuseSmem ? cute::max(cute::min(EpiTiles, 4), StagesD+1)
                                     : cute::min(EpiTiles, 4);
 
-  return cute::conditional_t<detail::sm90_is_tma_ptr_array_v<Schedule>,
-                             Sm90PtrArrayTmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmem, DelayTmaStore>,
-                             Sm90TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmem, DelayTmaStore>>{};
+  if constexpr (detail::sm90_is_ptr_array_tma_v<Schedule>) {
+      return Sm90PtrArrayTmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmem, 
+                                            DelayTmaStore, Schedule::NumEpilogueWarpGroups>{};
+  } 
+  else {
+    return Sm90TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmem, DelayTmaStore>{};
+  }
 }
 
 // Returns the smem layout atom to be used for C or D matrix
@@ -254,6 +258,9 @@ struct Sm90TmaBuilderImpl {
 
   using GmemStrideTypeC = cutlass::detail::TagToStrideC_t<GmemLayoutTagC>;
   using GmemStrideTypeD = cutlass::detail::TagToStrideC_t<GmemLayoutTagD>;
+  
+  using UnderlyingGmemStrideTypeC = cute::remove_pointer_t<GmemStrideTypeC>;
+  using UnderlyingGmemStrideTypeD = cute::remove_pointer_t<GmemStrideTypeD>;
 
   using CopyOpS2G = cute::conditional_t<detail::is_im2col_mode<GmemLayoutTagD>,
       SM90_TMA_STORE_IM2COL,
@@ -266,18 +273,15 @@ struct Sm90TmaBuilderImpl {
 
   // Get the smallest tiled copy we can use to retile the accumulators
   using CopyAtomC = Copy_Atom<SM90_U32x4_STSM_N, cutlass::half_t>;
-
-  using FusionDispatchPolicy = Sm90TmaWarpSpecialized<DispatchPolicy::StagesC, 
-                                                      DispatchPolicy::StagesD, 
-                                                      DispatchPolicy::FragmentSize, 
-                                                      DispatchPolicy::ReuseSmemC, 
-                                                      DispatchPolicy::DelayTmaStore>;
+  // Get register to register tiled copy that happen before shared memory store.
+  // Apply void as no register transform op needed currently.
+  using CopyOpR2R = void;
 
   // TMA builder allows for passing callbacks directly, which is either a fusion::FusionCallbacks
   // instance or a direct visitor implementation, e.g. fusion::Sm90LinearCombination
   using FusionCallbacks = 
     typename CallbacksBuilder<
-      FusionDispatchPolicy,
+      DispatchPolicy,
       FusionOpOrCallbacks,
       TileShape_MNK,
       EpilogueTile_MN,
@@ -294,12 +298,13 @@ struct Sm90TmaBuilderImpl {
       GmemStrideTypeD,
       FusionCallbacks,
       CopyOpG2S,
-      decltype(detail::sm90_get_epilogue_smem_swizzle_layout_atom<GmemStrideTypeC, ElementC, EpilogueTile_MN>()),
-      decltype(detail::sm90_get_smem_load_op_for_source<GmemStrideTypeC, ElementC>()),
+      decltype(detail::sm90_get_epilogue_smem_swizzle_layout_atom<UnderlyingGmemStrideTypeC, ElementC, EpilogueTile_MN>()),
+      decltype(detail::sm90_get_smem_load_op_for_source<UnderlyingGmemStrideTypeC, ElementC>()),
       CopyOpS2G,
-      decltype(detail::sm90_get_epilogue_smem_swizzle_layout_atom<GmemStrideTypeD, ElementD, EpilogueTile_MN>()),
-      decltype(detail::sm90_get_smem_store_op_for_accumulator<GmemStrideTypeD, ElementD>()),
-      CopyAtomC
+      decltype(detail::sm90_get_epilogue_smem_swizzle_layout_atom<UnderlyingGmemStrideTypeD, ElementD, EpilogueTile_MN>()),
+      decltype(detail::sm90_get_smem_store_op_for_accumulator<UnderlyingGmemStrideTypeD, ElementD>()),
+      CopyAtomC,
+      CopyOpR2R
     >;
 };
 
@@ -385,6 +390,7 @@ struct AuxStoreDescriptor {
 
 // No-smem builder
 template <
+  class OpClass,
   class TileShape_MNK,
   class ClusterShape_MNK,
   class EpilogueTileType,
@@ -401,7 +407,7 @@ template <
 >
 struct CollectiveBuilder<
     arch::Sm90,
-    arch::OpClassTensorOp,
+    OpClass,
     TileShape_MNK,
     ClusterShape_MNK,
     EpilogueTileType,
@@ -451,6 +457,7 @@ struct CollectiveBuilder<
 
 // Tma warp-specialized builder
 template <
+  class OpClass,
   class TileShape_MNK,
   class ClusterShape_MNK,
   class EpilogueTileType,
@@ -467,7 +474,7 @@ template <
 >
 struct CollectiveBuilder<
     arch::Sm90,
-    arch::OpClassTensorOp,
+    OpClass,
     TileShape_MNK,
     ClusterShape_MNK,
     EpilogueTileType,
@@ -483,7 +490,7 @@ struct CollectiveBuilder<
     FusionOperation,
     cute::enable_if_t<cute::is_same_v<Schedule, TmaWarpSpecialized> ||
                       cute::is_same_v<Schedule, TmaWarpSpecializedCooperative> ||
-                      cute::is_same_v<Schedule, PtrArrayTmaWarpSpecializedCooperative> >> {
+                      detail::sm90_is_ptr_array_tma_v<Schedule>>> {
 private:
   using ElementD = cute::conditional_t<cute::is_void_v<ElementD_>,
                      fusion::get_element_aux_t<FusionOperation>, ElementD_>;
@@ -512,6 +519,7 @@ public:
 
 // Auto builder
 template <
+  class OpClass,
   class TileShape_MNK,
   class ClusterShape_MNK,
   class EpilogueTileType,
@@ -527,7 +535,7 @@ template <
 >
 struct CollectiveBuilder<
     arch::Sm90,
-    arch::OpClassTensorOp,
+    OpClass,
     TileShape_MNK,
     ClusterShape_MNK,
     EpilogueTileType,
@@ -551,7 +559,7 @@ private:
   using EpilogueSchedule = NoSmemWarpSpecialized;
   using _CollectiveBuilder = CollectiveBuilder<
     arch::Sm90,
-    arch::OpClassTensorOp,
+    OpClass,
     TileShape_MNK,
     ClusterShape_MNK,
     EpilogueTileType,
@@ -573,6 +581,7 @@ public:
 
 // DEPRECATED Tma warp-specialized builder for elementwise fusion
 template <
+  class OpClass,
   class TileShape_MNK,
   class ClusterShape_MNK,
   class EpilogueTileType,
@@ -590,7 +599,7 @@ template <
 struct [[deprecated("Use TmaWarpSpecialized with fusion::LinCombEltAct instead")]]
 CollectiveBuilder<
     arch::Sm90,
-    arch::OpClassTensorOp,
+    OpClass,
     TileShape_MNK,
     ClusterShape_MNK,
     EpilogueTileType,
@@ -617,7 +626,7 @@ public:
   using CollectiveOp =
     typename CollectiveBuilder<
       arch::Sm90,
-      arch::OpClassTensorOp,
+      OpClass,
       TileShape_MNK,
       ClusterShape_MNK,
       EpilogueTileType,
@@ -636,6 +645,7 @@ public:
 
 // DEPRECATED Tma warp-specialized builder for bias + elementwise fusion
 template <
+  class OpClass,
   class TileShape_MNK,
   class ClusterShape_MNK,
   class EpilogueTileType,
@@ -653,7 +663,7 @@ template <
 struct [[deprecated("Use TmaWarpSpecialized with fusion::LinCombPerRowBiasEltAct or fusion::LinCombPerRowBiasEltActAux instead")]]
 CollectiveBuilder<
     arch::Sm90,
-    arch::OpClassTensorOp,
+    OpClass,
     TileShape_MNK,
     ClusterShape_MNK,
     EpilogueTileType,
@@ -713,6 +723,9 @@ private:
 
   // Get the smallest tiled copy we can use to retile the accumulators
   using CopyAtomC = Copy_Atom<SM90_U32x4_STSM_N, cutlass::half_t>;
+  // Get register to register tiled copy that happen before shared memory store.
+  // Apply void as no register transform op needed.
+  using CopyOpR2R = void;
 
 public:
   using CollectiveOp = cutlass::epilogue::collective::Sm90EpilogueTmaWarpSpecializedBiasElementwise<
@@ -732,7 +745,8 @@ public:
       SM90_TMA_STORE,
       decltype(detail::sm90_get_epilogue_smem_swizzle_layout_atom<GmemStrideTypeD, ElementD, EpilogueTile_MN>()),
       decltype(detail::sm90_get_smem_store_op_for_accumulator<GmemStrideTypeD, ElementD>()),
-      CopyAtomC
+      CopyAtomC,
+      CopyOpR2R
     >;
 };
 
@@ -740,6 +754,7 @@ public:
 // since swapping NNN kernels input matrix and transposing its output at the same time then
 // we can get TTN kernel.
 template <
+  class OpClass,
   class TileShape_MNK,
   class ClusterShape_MNK,
   class EpilogueTileType,
@@ -755,7 +770,7 @@ template <
 >
 struct CollectiveBuilder<
     arch::Sm90,
-    arch::OpClassTensorOp,
+    OpClass,
     TileShape_MNK,
     ClusterShape_MNK,
     EpilogueTileType,
diff --git a/include/cutlass/epilogue/collective/collective_builder.hpp b/include/cutlass/epilogue/collective/collective_builder.hpp
index becb1fb824..8ee169024a 100644
--- a/include/cutlass/epilogue/collective/collective_builder.hpp
+++ b/include/cutlass/epilogue/collective/collective_builder.hpp
@@ -30,6 +30,9 @@
  **************************************************************************************************/
 #pragma once
 
+#include <cute/arch/copy.hpp>         // cute::DefaultCopy
+#include <cute/util/type_traits.hpp>  // cute::is_base_of_v
+
 #include "cutlass/detail/dependent_false.hpp"
 #include "cutlass/epilogue/fusion/callbacks.hpp"
 
@@ -100,7 +103,7 @@ struct CallbacksBuilder<
   TileShape_MNK,
   EpilogueTile_MN,
   ElementAccumulator,
-  cute::enable_if_t<not is_base_of_v<fusion::FusionOperation, FusionCallbacks>>
+  cute::enable_if_t<not cute::is_base_of_v<fusion::FusionOperation, FusionCallbacks>>
 > {
   using Callbacks = FusionCallbacks;
 };
diff --git a/include/cutlass/epilogue/collective/collective_epilogue.hpp b/include/cutlass/epilogue/collective/collective_epilogue.hpp
index d939f3799b..27db871e74 100644
--- a/include/cutlass/epilogue/collective/collective_epilogue.hpp
+++ b/include/cutlass/epilogue/collective/collective_epilogue.hpp
@@ -53,14 +53,22 @@ class CollectiveEpilogue {
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
 #include "detail.hpp"
+
+//
+// Gemm
+//
 #include "default_epilogue.hpp"
 #include "default_epilogue_array.hpp"
 #include "epilogue_tensor_broadcast.hpp"
 #include "sm70_epilogue_vectorized.hpp"
+#include "sm70_epilogue_vectorized_array.hpp"
 #include "sm90_epilogue_tma_warpspecialized.hpp"
 #include "sm90_epilogue_tma_warpspecialized_bias_elementwise.hpp"
 #include "sm90_epilogue_array_tma_warpspecialized.hpp"
 #if defined (SYCL_INTEL_TARGET)
 #include "xe_epilogue.hpp"
 #endif
+//
+// Conv
+//
 /////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/include/cutlass/epilogue/collective/detail.hpp b/include/cutlass/epilogue/collective/detail.hpp
index a6e5e2f4d6..6c0368e09b 100644
--- a/include/cutlass/epilogue/collective/detail.hpp
+++ b/include/cutlass/epilogue/collective/detail.hpp
@@ -71,6 +71,62 @@ is_im2col() {
       || cute::is_same_v<Stride, cutlass::detail::TagToStrideC_t<cutlass::layout::TensorNDHWC>>;
 }
 
+template<class Schedule>
+struct sm90_is_ptr_array_tma : cute::false_type {};
+
+template<>
+struct sm90_is_ptr_array_tma<PtrArrayTmaWarpSpecializedCooperative> : cute::true_type {};
+
+template<>
+struct sm90_is_ptr_array_tma<PtrArrayTmaWarpSpecializedPingpong> : cute::true_type {};
+
+template<>
+struct sm90_is_ptr_array_tma<PtrArrayTmaWarpSpecialized> : cute::true_type {};
+
+template<class Schedule>
+static constexpr bool sm90_is_ptr_array_tma_v = sm90_is_ptr_array_tma<Schedule>::value;
+
+template<class Schedule>
+struct sm90_is_ptr_array_tma_cooperative : cute::false_type {};
+
+template<>
+struct sm90_is_ptr_array_tma_cooperative<PtrArrayTmaWarpSpecializedCooperative> : cute::true_type {};
+
+template<class Schedule>
+static constexpr bool sm90_is_ptr_array_tma_cooperative_v = sm90_is_ptr_array_tma_cooperative<Schedule>::value;
+
+template<class Schedule>
+struct sm90_is_ptr_array_tma_pingpong : cute::false_type {};
+
+template<>
+struct sm90_is_ptr_array_tma_pingpong<PtrArrayTmaWarpSpecializedPingpong> : cute::true_type {};
+
+template<class Schedule>
+static constexpr bool sm90_is_ptr_array_tma_pingpong_v = sm90_is_ptr_array_tma_pingpong<Schedule>::value;
+
+template<class DispatchPolicy>
+struct sm90_is_ptr_array_tma_dispatch_policy : cute::false_type {};
+
+template<
+  int StagesC,
+  int StagesD,
+  int FragmentSize,
+  bool ReuseSmemC,
+  bool DelayTmaStore,
+  int NumEpilogueWarpGroups
+>
+struct sm90_is_ptr_array_tma_dispatch_policy<
+    Sm90PtrArrayTmaWarpSpecialized<StagesC, 
+                                   StagesD, 
+                                   FragmentSize,
+                                   ReuseSmemC, 
+                                   DelayTmaStore, 
+                                   NumEpilogueWarpGroups>> 
+    : cute::true_type {};
+
+template<class DispatchPolicy>
+static constexpr bool sm90_is_ptr_array_tma_dispatch_policy_v = sm90_is_ptr_array_tma_dispatch_policy<DispatchPolicy>::value;
+
 using cutlass::atomic_maximum;
 
 template <class T>
@@ -79,14 +135,11 @@ static constexpr int elements_per_access_v = cutlass::sizeof_bits<uint32_t>::val
 template <class EpilogueSchedule>
 static constexpr bool sm90_is_cooperative_v =
   cute::is_base_of_v<cutlass::epilogue::TmaWarpSpecializedCooperative, EpilogueSchedule> ||
-  cute::is_base_of_v<cutlass::epilogue::PtrArrayTmaWarpSpecializedCooperative, EpilogueSchedule>;
-
-template <class EpilogueSchedule>
-static constexpr bool sm90_is_tma_ptr_array_v =
-  cute::is_base_of_v<cutlass::epilogue::PtrArrayTmaWarpSpecializedCooperative, EpilogueSchedule>;
+  sm90_is_ptr_array_tma_cooperative_v<EpilogueSchedule>;
 
 template <class EpilogueSchedule>
 static constexpr bool sm90_is_warp_specialized_v =
+  (!sm90_is_ptr_array_tma_cooperative_v<EpilogueSchedule> && sm90_is_ptr_array_tma_v<EpilogueSchedule>) ||
   cute::is_base_of_v<cutlass::epilogue::TmaWarpSpecialized, EpilogueSchedule>;
 
 template <class GmemLayoutTag>
@@ -146,6 +199,14 @@ struct IsThreadEpilogueOpWithActivation <ThreadEpilogueOp, cute::enable_if_t<Thr
   using type = typename ThreadEpilogueOp::ActivationFn;
 };
 
+template <typename ThreadEpilogueOp, typename = void>
+struct IsThreadEpilogueOpWithElementwiseArguments : cute::false_type {};
+
+template <typename ThreadEpilogueOp>
+struct IsThreadEpilogueOpWithElementwiseArguments<
+        ThreadEpilogueOp,
+        cute::void_t<typename ThreadEpilogueOp::ElementwiseOp::Arguments>> : cute::true_type {};
+
 // Wrapper class to use operator-style epilogues in sm90 TMA warp-specialized kernels
 template <class EpilogueOp>
 class Sm90TmaWarpSpecializedAdapter : public EpilogueOp {
@@ -199,7 +260,11 @@ class Sm90TmaWarpSpecializedAdapter : public EpilogueOp {
   }
 
   CUTLASS_DEVICE auto
-  load_init([[maybe_unused]] typename EpilogueOp::Params const& params, [[maybe_unused]] int32_t const sm_count, [[maybe_unused]] int32_t const sm_idx) const {
+  load_init(
+    [[maybe_unused]] typename EpilogueOp::Params const& params,
+    [[maybe_unused]] TensorMapStorage& shared_tensormaps,
+    [[maybe_unused]] int32_t sm_count,
+    [[maybe_unused]] int32_t sm_idx) {
     return cute::make_tuple(nullptr);
   }
 
@@ -243,7 +308,7 @@ class Sm90TmaWarpSpecializedAdapter : public EpilogueOp {
       [[maybe_unused]] TensorStorage& shared_tensors,
       [[maybe_unused]] TensorMapC const& load_tensormap,
       [[maybe_unused]] int subtile_idx=-1,
-      [[maybe_unused]] bool return_prior_state = false)
+      [[maybe_unused]] bool wait = false)
   {
     return load_pipe_producer_state;
   }
@@ -257,8 +322,12 @@ class Sm90TmaWarpSpecializedAdapter : public EpilogueOp {
   }
 
   CUTLASS_DEVICE auto
-  store_init([[maybe_unused]] typename EpilogueOp::Params const& params, [[maybe_unused]] int32_t const sm_count,
-      [[maybe_unused]] int32_t const sm_idx) const {
+  store_init(
+    [[maybe_unused]] typename EpilogueOp::Params const& params,
+    [[maybe_unused]] TensorMapStorage& shared_tensormaps,
+    [[maybe_unused]] int32_t sm_count,
+    [[maybe_unused]] int32_t sm_idx,
+    [[maybe_unused]] int32_t warp_group_idx) {
     return cute::make_tuple(nullptr);
   }
 
@@ -369,22 +438,25 @@ class Sm90TmaWarpSpecializedAdapter : public EpilogueOp {
 
   // Dummy methods to perform different parts of TMA/Tensormap modifications
 
-  template <bool IsLoad>
+  template <bool IsLoad,
+            class ProblemShapeMNKL>
   CUTLASS_DEVICE
   void
   tensormaps_perform_update(
-      [[maybe_unused]] TensorMapStorage& shared_tensormap,
+      [[maybe_unused]] TensorMapStorage& shared_tensormaps,
       [[maybe_unused]] typename EpilogueOp::Params const& params,
       [[maybe_unused]] cute::TmaDescriptor const* tensormap,
-      [[maybe_unused]] int32_t next_batch) { }
+      [[maybe_unused]] ProblemShapeMNKL problem_shape,
+      [[maybe_unused]] int32_t next_batch,
+      [[maybe_unused]] int32_t warp_group_idx) { }
 
   template <bool IsLoad>
   CUTLASS_DEVICE
   void
   tensormaps_cp_fence_release(
-      [[maybe_unused]] TensorMapStorage& shared_tensormap,
+      [[maybe_unused]] TensorMapStorage& shared_tensormaps,
       [[maybe_unused]] cute::TmaDescriptor const* tensormap,
-      [[maybe_unused]] uint32_t lane_predicate) { }
+      [[maybe_unused]] int32_t warp_group_idx) { }
 
   template <bool IsLoad>
   CUTLASS_DEVICE
diff --git a/include/cutlass/epilogue/collective/sm70_epilogue_vectorized.hpp b/include/cutlass/epilogue/collective/sm70_epilogue_vectorized.hpp
index 689bb7a9f0..25280ed12b 100644
--- a/include/cutlass/epilogue/collective/sm70_epilogue_vectorized.hpp
+++ b/include/cutlass/epilogue/collective/sm70_epilogue_vectorized.hpp
@@ -46,6 +46,25 @@ namespace collective {
 
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
+template <
+  class StrideC,
+  class StrideD,
+  class ThreadEpilogueOp,
+  class SmemLayout,
+  class CopyAtomR2S,
+  class TiledCopyS2R,
+  class CopyAtomR2G,
+  class EpilogueScheduleType = EpilogueSimtVectorized,
+  class Enable = void
+>
+class Epilogue {
+  static_assert(cute::is_same_v<EpilogueScheduleType, EpilogueSimtVectorized> ||
+                cute::is_same_v<EpilogueScheduleType, EpiloguePtrArraySimtVectorized>, 
+                "Could not find an epilogue specialization.");
+};
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Epilogue Vectorized
 /// Applies an element wise operation to all elements within the fragment
 /// and writes it out to destination storage.
 ///
@@ -61,9 +80,22 @@ template <
   class SmemLayout_,
   class CopyAtomR2S_,
   class TiledCopyS2R_,
-  class CopyAtomR2G_
+  class CopyAtomR2G_,
+  class EpilogueScheduleType_
 >
-class Epilogue {
+class Epilogue<
+        StrideC_,
+        StrideD_,
+        ThreadEpilogueOp_,
+        SmemLayout_,
+        CopyAtomR2S_,
+        TiledCopyS2R_,
+        CopyAtomR2G_,
+        EpilogueScheduleType_,
+        cute::enable_if_t<
+          cute::is_same_v<EpilogueScheduleType_, EpilogueSimtVectorized>
+        >
+      > {
 public:
   //
   // Type Aliases
@@ -78,15 +110,17 @@ class Epilogue {
   using StrideC = StrideC_;
   using ElementD = typename ThreadEpilogueOp::ElementD;
   using StrideD = StrideD_;
-
+  using ElementBias = typename detail::IsThreadEpilogueOpWithBias<ThreadEpilogueOp>::type;
   using SmemLayout   = SmemLayout_;
   using CopyAtomR2S  = CopyAtomR2S_;
   using TiledCopyS2R = TiledCopyS2R_;
   using CopyAtomR2G  = CopyAtomR2G_;
 
-  static const int kOutputAlignment = ThreadEpilogueOp::kCount;
+  using GmemTiledCopyC = void;
+  using GmemTiledCopyD = CopyAtomR2G;
 
-  using AlignmentType = typename cute::uint_bit<sizeof_bits<ElementOutput>::value * kOutputAlignment>::type;
+  static constexpr bool IsEpilogueBiasSupported = detail::IsThreadEpilogueOpWithBias<ThreadEpilogueOp>::value;
+  using StrideBias = cute::conditional_t<detail::is_m_major<StrideD>(), Stride<_1,_0,int64_t>, Stride<_0,_1,int64_t>>;
 
   static_assert(cute::rank(StrideC{}) == 3, "StrideCD must be rank-3: [M, N, L]");
   static_assert(cute::rank(StrideD{}) == 3, "StrideCD must be rank-3: [M, N, L]");
@@ -96,9 +130,35 @@ class Epilogue {
     cute::array_aligned<ElementAccumulator, cute::cosize_v<SmemLayout>> smem_epilogue;
   };
 
+  static constexpr bool IsActHasArgs = detail::IsThreadEpilogueOpWithElementwiseArguments<ThreadEpilogueOp>::value;
+
   // Host side epilogue arguments
+  template<class ThreadEpiOp, class = void>
+  struct ThreadEpilogueOpArguments {
+    ElementScalar alpha{0};
+    ElementScalar beta{0};
+    ElementScalar const* alpha_ptr = nullptr;
+    ElementScalar const* beta_ptr = nullptr;
+    ElementBias const* bias_ptr = nullptr;
+    StrideBias dBias{};
+  };  
+
+  template<class ThreadEpiOp>
+  struct ThreadEpilogueOpArguments<
+          ThreadEpiOp,
+          cute::enable_if_t<detail::IsThreadEpilogueOpWithElementwiseArguments<ThreadEpiOp>::value>> {
+    ElementScalar alpha{0};
+    ElementScalar beta{0};
+    ElementScalar const* alpha_ptr = nullptr;
+    ElementScalar const* beta_ptr = nullptr;
+    ElementBias const* bias_ptr = nullptr;
+    StrideBias dBias{};
+    typename ThreadEpiOp::ElementwiseArguments activation{};
+  };
+
   struct Arguments {
-    typename ThreadEpilogueOp::Params thread{};
+    ThreadEpilogueOpArguments<ThreadEpilogueOp> thread{};
+    using StrideBias = decltype(thread.dBias);
     ElementC const* ptr_C = nullptr;
     StrideC dC{};
     ElementD* ptr_D = nullptr;
@@ -106,7 +166,32 @@ class Epilogue {
   };
 
   // Device side epilogue params
-  using Params = Arguments;
+  template<class ThreadEpiOp, class = void>
+  struct ParamsType {
+    typename ThreadEpiOp::Params thread{};
+    ElementC const* ptr_C = nullptr;
+    StrideC dC{};
+    ElementD* ptr_D = nullptr;
+    StrideD dD{};
+    ElementBias const* ptr_Bias = nullptr;
+    StrideBias dBias{};
+  };
+
+  template<class ThreadEpiOp>
+  struct ParamsType<
+          ThreadEpiOp,
+          cute::enable_if_t<detail::IsThreadEpilogueOpWithElementwiseArguments<ThreadEpiOp>::value>> {
+    typename ThreadEpiOp::Params thread{};
+    typename ThreadEpiOp::ElementwiseArguments activation{};
+    ElementC const* ptr_C = nullptr;
+    StrideC dC{};
+    ElementD* ptr_D = nullptr;
+    StrideD dD{};
+    ElementBias const* ptr_Bias = nullptr;
+    StrideBias dBias{};
+  };
+
+  using Params = ParamsType<ThreadEpilogueOp>;
 
   //
   // Methods
@@ -117,8 +202,36 @@ class Epilogue {
   to_underlying_arguments(
       [[maybe_unused]] ProblemShape const& _,
       Arguments const& args,
-      [[maybe_unused]] void* workspace) {
-    return args;
+      [[maybe_unused]] void* workspace) { 
+    typename ThreadEpilogueOp::Params thread_op_args;
+    thread_op_args.alpha = args.thread.alpha;
+    thread_op_args.beta = args.thread.beta;
+    thread_op_args.alpha_ptr = args.thread.alpha_ptr;
+    thread_op_args.beta_ptr = args.thread.beta_ptr;
+
+    if constexpr (IsActHasArgs) {
+      return {
+        thread_op_args,
+        args.thread.activation,
+        args.ptr_C,
+        args.dC,
+        args.ptr_D,
+        args.dD,
+        args.thread.bias_ptr,
+        args.thread.dBias
+      };
+    }
+    else {
+      return {
+        thread_op_args,
+        args.ptr_C,
+        args.dC,
+        args.ptr_D,
+        args.dD,
+        args.thread.bias_ptr,
+        args.thread.dBias
+      };
+    }
   }
 
   template <class ProblemShape>
@@ -169,8 +282,7 @@ class Epilogue {
       TiledMma tiled_mma,
       ResidueMNK residue_mnk,
       int thread_idx,
-      char* smem_buf)
-  {
+      char* smem_buf) {
     using namespace cute;
     using X = Underscore;
 
@@ -192,88 +304,112 @@ class Epilogue {
     auto L = get<3>(problem_shape_mnkl);
 
     // Represent the full output tensor
-    Tensor mC_mnl = make_tensor(make_gmem_ptr(params.ptr_C), make_shape(M,N,L), params.dC);      //             (m,n,l)
-    Tensor mD_mnl = make_tensor(make_gmem_ptr(params.ptr_D), make_shape(M,N,L), params.dD);      //             (m,n,l)
-    Tensor gC_mnl = local_tile(mC_mnl, blk_shape_MNK, make_coord(_,_,_), Step<_1,_1, X>{});      // (BLK_M,BLK_N,m,n,l)
-    Tensor gD_mnl = local_tile(mD_mnl, blk_shape_MNK, make_coord(_,_,_), Step<_1,_1, X>{});      // (BLK_M,BLK_N,m,n,l)
+    Tensor mC_mnl = make_tensor(make_gmem_ptr(params.ptr_C), make_shape(M,N,L), params.dC);             //             (m,n,l)
+    Tensor mD_mnl = make_tensor(make_gmem_ptr(params.ptr_D), make_shape(M,N,L), params.dD);             //             (m,n,l)
+    Tensor mBias_mnl = make_tensor(make_gmem_ptr(params.ptr_Bias), make_shape(M,N,L), params.dBias);    //             (m,n,l)
+
+    Tensor gC_mnl = local_tile(mC_mnl, blk_shape_MNK, make_coord(_,_,_), Step<_1,_1, X>{});             // (BLK_M,BLK_N,m,n,l)
+    Tensor gD_mnl = local_tile(mD_mnl, blk_shape_MNK, make_coord(_,_,_), Step<_1,_1, X>{});             // (BLK_M,BLK_N,m,n,l)
+    Tensor gBias_mnl = local_tile(mBias_mnl, blk_shape_MNK, make_coord(_,_,_), Step<_1,_1, X>{});       // (BLK_M,BLK_N,m,n,l)
 
     // Slice to get the tile this CTA is responsible for
     auto [m_coord, n_coord, k_coord, l_coord] = blk_coord_mnkl;
     Tensor gC = gC_mnl(_,_,m_coord,n_coord,l_coord);                                                   // (BLK_M,BLK_N)
     Tensor gD = gD_mnl(_,_,m_coord,n_coord,l_coord);                                                   // (BLK_M,BLK_N)
-
+    Tensor gBias = gBias_mnl(_,_,m_coord,n_coord,l_coord);                                             // (BLK_M,BLK_N)
+  
     // Construct a tensor in SMEM that we can partition for rearranging data
     SharedStorage& storage = *reinterpret_cast<SharedStorage*>(smem_buf);
-    Tensor sC = make_tensor(make_smem_ptr(storage.smem_epilogue.data()), SmemLayout{});              // (SMEM_M,SMEM_N)
+    Tensor sAcc = make_tensor(make_smem_ptr(storage.smem_epilogue.data()), SmemLayout{});            // (SMEM_M,SMEM_N)
 
-    // Partition sC to match the accumulator partitioning
+    // Partition sAcc to match the accumulator partitioning
     auto tiled_r2s = make_tiled_copy_C(CopyAtomR2S{}, tiled_mma);
-    auto tC     = tiled_r2s.get_thread_slice(thread_idx);
-    Tensor tCaC = tC.retile_S(accumulators);                                          // ((Atom,AtomNum), MMA_M, MMA_N)
-    Tensor tCsC = tC.partition_D(sC);                                                 // ((Atom,AtomNum),PIPE_M,PIPE_N)
+    auto thread_r2s     = tiled_r2s.get_thread_slice(thread_idx);
+    Tensor tRS_rAcc = thread_r2s.retile_S(accumulators);                              // ((Atom,AtomNum), MMA_M, MMA_N)
+    Tensor tRS_sAcc = thread_r2s.partition_D(sAcc);                                   // ((Atom,AtomNum),PIPE_M,PIPE_N)
 
     // Tile gD and gC by the shape of SmemLayout first
-    auto tile  = make_shape(size<0>(sC), size<1>(sC));
+    auto tile  = make_shape(size<0>(sAcc), size<1>(sAcc));
     Tensor gCt = flat_divide(gC, tile);                                                // (SMEM_M,SMEM_N,TILE_M,TILE_N)
     Tensor gDt = flat_divide(gD, tile);                                                // (SMEM_M,SMEM_N,TILE_M,TILE_N)
+    Tensor gBiast = flat_divide(gBias, tile);                                          // (SMEM_M,SMEM_N,TILE_M,TILE_N)
 
-    // Partition sC, gC, and gD for the output
+    // Partition sAcc, gC, and gD for the output
     auto tiled_s2r = TiledCopyS2R{};
-    auto tD     = tiled_s2r.get_thread_slice(thread_idx);
-    Tensor tDsC = tD.partition_S(sC);                                   //               ((Atom,AtomNum),ATOM_M,ATOM_N)
-    Tensor tDgC = tD.partition_D(gCt);                                  // ((Atom,AtomNum),ATOM_M,ATOM_N,TILE_M,TILE_N)
-    Tensor tDgD = tD.partition_D(gDt);                                  // ((Atom,AtomNum),ATOM_M,ATOM_N,TILE_M,TILE_N)
+    auto thread_s2r     = tiled_s2r.get_thread_slice(thread_idx);
+    Tensor tSR_sAcc = thread_s2r.partition_S(sAcc);                      //               ((Atom,AtomNum),ATOM_M,ATOM_N)
+    Tensor tSR_gC = thread_s2r.partition_D(gCt);                         // ((Atom,AtomNum),ATOM_M,ATOM_N,TILE_M,TILE_N)
+    Tensor tSR_gD = thread_s2r.partition_D(gDt);                         // ((Atom,AtomNum),ATOM_M,ATOM_N,TILE_M,TILE_N)
+    Tensor tSR_gBias = thread_s2r.partition_D(gBiast);                   // ((Atom,AtomNum),ATOM_M,ATOM_N,TILE_M,TILE_N)
 
     // Allocate intermediate registers on the dst tensors
-    Tensor tDrC = make_tensor<ElementAccumulator>(take<0,3>(shape(tDgC)));            // ((Atom,AtomNum),ATOM_M,ATOM_N)
-    Tensor tDrD = make_tensor<ElementOutput>(shape(tDrC));                            // ((Atom,AtomNum),ATOM_M,ATOM_N)
+    Tensor tSR_rAcc = make_tensor<ElementAccumulator>(take<0,3>(shape(tSR_gC)));       // ((Atom,AtomNum),ATOM_M,ATOM_N)
+    Tensor tSR_rC = make_tensor<ElementC>(shape(tSR_rAcc));                            // ((Atom,AtomNum),ATOM_M,ATOM_N)
+    Tensor tSR_rD = make_tensor<ElementD>(shape(tSR_rAcc));                            // ((Atom,AtomNum),ATOM_M,ATOM_N)
+    Tensor tSR_rBias = make_tensor_like(tSR_gBias);                      // ((Atom,AtomNum),ATOM_M,ATOM_N,TILE_M,TILE_N)
 
     // Repeat the D-partitioning for coordinates and predication
-    Tensor cD   = make_identity_tensor(make_shape(size<0>(gD),size<1>(gD)));          // (BLK_M,BLK_N) -> (blk_m,blk_n)
-    Tensor cDt  = flat_divide(cD, tile);                                //                (SMEM_M,SMEM_N,TILE_M,TILE_N)
-    Tensor tDcD = tD.partition_D(cDt);                                  // ((Atom,AtomNum),ATOM_M,ATOM_N,TILE_M,TILE_N)
+    Tensor cD   = make_identity_tensor(make_shape(size<0>(gD),size<1>(gD)));           // (BLK_M,BLK_N) -> (blk_m,blk_n)
+    Tensor cDt  = flat_divide(cD, tile);                                 //                (SMEM_M,SMEM_N,TILE_M,TILE_N)
+    Tensor tSR_cD = thread_s2r.partition_D(cDt);                         // ((Atom,AtomNum),ATOM_M,ATOM_N,TILE_M,TILE_N)
 
-    CUTE_STATIC_ASSERT(size<1>(tCaC) % size<3>(tDgC) == 0);  // TILE_M divides MMA_M
-    CUTE_STATIC_ASSERT(size<2>(tCaC) % size<4>(tDgC) == 0);  // TILE_N divides MMA_N
-    CUTE_STATIC_ASSERT(typename TiledCopyS2R::TiledNumThr{} == size<0>(typename TiledMma::AtomLayoutC_TV{}));
+    CUTE_STATIC_ASSERT(size<1>(tRS_rAcc) % size<3>(tSR_gC) == 0);  // TILE_M divides MMA_M
+    CUTE_STATIC_ASSERT(size<2>(tRS_rAcc) % size<4>(tSR_gC) == 0);  // TILE_N divides MMA_N
 
 #if 0
     if (thread_idx == 0 && m_coord == 0 && n_coord == 0) {
       print("aC   : "); print(accumulators.layout()); print("\n");
       print("gC   : "); print(gC.layout()); print("\n");
       print("gD   : "); print(gD.layout()); print("\n");
-      print("sC   : "); print(sC.layout()); print("\n");
+      print("gBias   : "); print(gBias.layout()); print("\n");
+      print("sAcc   : "); print(sAcc.layout()); print("\n");
       print("\n");
-      print("tCsC : "); print(tCsC.layout()); print("\n");
-      print("tCaC : "); print(tCaC.layout()); print("\n");
+      print("tRS_sAcc : "); print(tRS_sAcc.layout()); print("\n");
+      print("tRS_rAcc : "); print(tRS_rAcc.layout()); print("\n");
       print("\n");
       print("gDt  : "); print(gDt.layout()); print("\n");
-      print("tDsC : "); print(tDsC.layout()); print("\n");
-      print("tDrC : "); print(tDrC.layout()); print("\n");
+      print("tSR_sAcc : "); print(tSR_sAcc.layout()); print("\n");
+      print("tSR_rAcc : "); print(tSR_rAcc.layout()); print("\n");
       print("\n");
-      print("tDrD : "); print(tDrD.layout()); print("\n");
-      print("tDgC : "); print(tDgC.layout()); print("\n");
-      print("tDgD : "); print(tDgD.layout()); print("\n");
+      print("tSR_rC : "); print(tSR_rC.layout()); print("\n");
+      print("tSR_rD : "); print(tSR_rD.layout()); print("\n");
+      print("tSR_gC : "); print(tSR_gC.layout()); print("\n");
+      print("tSR_gD : "); print(tSR_gD.layout()); print("\n");
       print("\n");
+      print("gBiast  : "); print(gBiast.layout()); print("\n");
+      print("tSR_gBias  : "); print(tSR_gBias.layout()); print("\n");
+      print("tSR_rBias  : "); print(tSR_rBias.layout()); print("\n");
     }
 #endif
 
+    if constexpr (IsEpilogueBiasSupported) {
+      if (params.ptr_Bias) {
+        // Filter so we don't issue redundant copies over stride-0 modes
+        // (only works if 0-strides are in same location, which is by construction)
+        Tensor tSR_gBias_flt = filter_zeros(tSR_gBias);
+        Tensor tSR_rBias_flt = filter_zeros(tSR_rBias);
+        Tensor tSR_cD_flt = filter_zeros(tSR_cD, tSR_gBias.stride());
+
+        // Step 0. Copy Bias from GMEM to fragment
+        auto pred_fn = [&] (auto const&... coords) { return elem_less(tSR_cD_flt(coords...), take<0, 2>(residue_mnk)); };
+        copy_if(pred_fn, tSR_gBias_flt, tSR_rBias_flt);    
+      }
+    }
+
     // For each tiling needed for SmemLayout to cover shape(gD)
     CUTLASS_PRAGMA_UNROLL
-    for (int step_m = 0; step_m < size<2>(cDt); ++step_m)
-    {
+    for (int step_m = 0; step_m < size<2>(cDt); ++step_m) {
       CUTLASS_PRAGMA_UNROLL
-      for (int step_n = 0; step_n < size<3>(cDt); ++step_n)
-      {
+      for (int step_n = 0; step_n < size<3>(cDt); ++step_n) {
         // Step 1. Copy to SMEM
         CUTLASS_PRAGMA_UNROLL
-        for (int pipe_m = 0; pipe_m < size<1>(tCsC); ++pipe_m) {
+        for (int pipe_m = 0; pipe_m < size<1>(tRS_sAcc); ++pipe_m) {
           CUTLASS_PRAGMA_UNROLL
-          for (int pipe_n = 0; pipe_n < size<2>(tCsC); ++pipe_n) {
-            int mma_m = step_m * size<1>(tCsC) + pipe_m;
-            int mma_n = step_n * size<2>(tCsC) + pipe_n;
+          for (int pipe_n = 0; pipe_n < size<2>(tRS_sAcc); ++pipe_n) {
+            int mma_m = step_m * size<1>(tRS_sAcc) + pipe_m;
+            int mma_n = step_n * size<2>(tRS_sAcc) + pipe_n;
 
-            copy(tiled_r2s, tCaC(_,mma_m,mma_n), tCsC(_,pipe_m,pipe_n));
+            copy(tiled_r2s, tRS_rAcc(_,mma_m,mma_n), tRS_sAcc(_,pipe_m,pipe_n));
           }
         }
 
@@ -281,59 +417,115 @@ class Epilogue {
         synchronize();
 
         // Step 3. Copy from SMEM into a fragment
-        copy(tiled_s2r, tDsC, tDrC);
+        copy(tiled_s2r, tSR_sAcc, tSR_rAcc);
 
         // Step 4. Wait for SMEM reads to complete
         synchronize();
 
-        Tensor tDgDmn = tDgD(_,_,_,step_m,step_n);
-        Tensor tDcDmn = tDcD(_,_,_,step_m,step_n);
+        Tensor tSR_gDmn = tSR_gD(_,_,_,step_m,step_n);
+        Tensor tSR_cDmn = tSR_cD(_,_,_,step_m,step_n);
+
+        if constexpr (IsEpilogueBiasSupported) {
+          Tensor tSR_rBiasmn = tSR_rBias(_,_,_,step_m,step_n);
+
+          if (epilogue_op.is_source_needed()) {
+            // source is needed
+            Tensor tSR_gCmn = tSR_gC(_,_,_,step_m,step_n);
+
+            // Step 5. Copy C from GMEM to a fragment
+            CUTLASS_PRAGMA_UNROLL
+            for (int m = 0; m < size<1>(tSR_gDmn); ++m) {
+              CUTLASS_PRAGMA_UNROLL
+              for (int n = 0; n < size<2>(tSR_gDmn); ++n) {
+                // Predication
+                if (elem_less(tSR_cDmn(0,m,n), take<0,2>(residue_mnk))) {
+                  CUTLASS_PRAGMA_UNROLL
+                  for (int i = 0; i < size<0>(tSR_rAcc); ++i) {
+                    tSR_rC(i,m,n) = tSR_gCmn(i,m,n);
+                  }
+                }
+              }
+            }
+
+            // Step 6. Elementwise operation with conversion
+            CUTLASS_PRAGMA_UNROLL
+            for (int i = 0; i < size(tSR_rAcc); ++i) {
+              if constexpr (IsActHasArgs) {
+                epilogue_op(tSR_rD(i), tSR_rD(i), tSR_rAcc(i), tSR_rC(i), tSR_rBiasmn(i), params.activation);
+              } else {
+                epilogue_op(tSR_rD(i), tSR_rD(i), tSR_rAcc(i), tSR_rC(i), tSR_rBiasmn(i));
+              }
+            }
+          }
+          else {
+            // source is not needed, avoid load and lift compute
+
+            // Step 5. Elementwise operation with conversion
+            CUTLASS_PRAGMA_UNROLL
+            for (int i = 0; i < size(tSR_rAcc); ++i) {
+              if constexpr (IsActHasArgs) {
+                epilogue_op(tSR_rD(i), tSR_rD(i), tSR_rAcc(i), tSR_rBiasmn(i), params.activation);
+              } else {
+                epilogue_op(tSR_rD(i), tSR_rD(i), tSR_rAcc(i), tSR_rBiasmn(i));
+              }
+            }
+          }
 
-        if (epilogue_op.is_source_needed()) {
-          // source is needed
-          Tensor tDgCmn = tDgC(_,_,_,step_m,step_n);
           CUTLASS_PRAGMA_UNROLL
-          for (int m = 0; m < size<1>(tDgDmn); ++m)
-          {
+          for (int m = 0; m < size<1>(tSR_gDmn); ++m) {
             CUTLASS_PRAGMA_UNROLL
-            for (int n = 0; n < size<2>(tDgDmn); ++n)
-            {
+            for (int n = 0; n < size<2>(tSR_gDmn); ++n) {
               // Predication
-              if (get<0>(tDcDmn(0,m,n)) < get<0>(residue_mnk) &&
-                  get<1>(tDcDmn(0,m,n)) < get<1>(residue_mnk))
-              {
-                // Step 5. Elementwise operation with conversion
-                CUTLASS_PRAGMA_UNROLL
-                for (int i = 0; i < size<0>(tDrC); ++i) {
-                  tDrD(i,m,n) = epilogue_op(tDrC(i,m,n), tDgCmn(i,m,n));
+              if (elem_less(tSR_cDmn(0,m,n), take<0,2>(residue_mnk))) {
+                // The Last Step. Copy to GMEM
+                copy(CopyAtomR2G{}, tSR_rD(_,m,n), tSR_gDmn(_,m,n));
+              }
+            }
+          }
+        } else {
+          if (epilogue_op.is_source_needed()) {
+            // source is needed
+            Tensor tSR_gCmn = tSR_gC(_,_,_,step_m,step_n);
+
+            // Step 5. Copy C from GMEM to a fragment
+            CUTLASS_PRAGMA_UNROLL
+            for (int m = 0; m < size<1>(tSR_gDmn); ++m) {
+              CUTLASS_PRAGMA_UNROLL
+              for (int n = 0; n < size<2>(tSR_gDmn); ++n) {
+                // Predication
+                if (elem_less(tSR_cDmn(0,m,n), take<0,2>(residue_mnk))) {
+                  CUTLASS_PRAGMA_UNROLL
+                  for (int i = 0; i < size<0>(tSR_rAcc); ++i) {
+                    tSR_rC(i,m,n) = tSR_gCmn(i,m,n);
+                  }
                 }
-                // Step 6. Copy to GMEM
-                copy(CopyAtomR2G{}, tDrD(_,m,n), tDgDmn(_,m,n));
               }
             }
+
+            // Step 6. Elementwise operation with conversion
+            CUTLASS_PRAGMA_UNROLL
+            for (int i = 0; i < size(tSR_rAcc); ++i) {
+              tSR_rD(i) = epilogue_op(tSR_rAcc(i), tSR_rC(i));
+            }
           }
-        }
-        else {
-          // source is not needed, avoid load and lift compute
+          else {
+            // source is not needed, avoid load and lift compute
 
-          // Step 5. Elementwise operation with conversion
-          CUTLASS_PRAGMA_UNROLL
-          for (int i = 0; i < size(tDrC); ++i) {
-            tDrD(i) = epilogue_op(tDrC(i));
+            // Step 5. Elementwise operation with conversion
+            CUTLASS_PRAGMA_UNROLL
+            for (int i = 0; i < size(tSR_rAcc); ++i) {
+              tSR_rD(i) = epilogue_op(tSR_rAcc(i));
+            }
           }
 
           CUTLASS_PRAGMA_UNROLL
-          for (int m = 0; m < size<1>(tDgDmn); ++m)
-          {
+          for (int m = 0; m < size<1>(tSR_gDmn); ++m) {
             CUTLASS_PRAGMA_UNROLL
-            for (int n = 0; n < size<2>(tDgDmn); ++n)
-            {
+            for (int n = 0; n < size<2>(tSR_gDmn); ++n) {
               // Predication
-              if (get<0>(tDcDmn(0,m,n)) < get<0>(residue_mnk) &&
-                  get<1>(tDcDmn(0,m,n)) < get<1>(residue_mnk))
-              {
-                // Step 6. Copy to GMEM
-                copy(CopyAtomR2G{}, tDrD(_,m,n), tDgDmn(_,m,n));
+              if (elem_less(tSR_cDmn(0,m,n), take<0,2>(residue_mnk))) {
+                // The Last Step. Copy to GMEM
+                copy(CopyAtomR2G{}, tSR_rD(_,m,n), tSR_gDmn(_,m,n));
               }
             }
           }
diff --git a/include/cutlass/epilogue/collective/sm70_epilogue_vectorized_array.hpp b/include/cutlass/epilogue/collective/sm70_epilogue_vectorized_array.hpp
new file mode 100644
index 0000000000..5583f96328
--- /dev/null
+++ b/include/cutlass/epilogue/collective/sm70_epilogue_vectorized_array.hpp
@@ -0,0 +1,412 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/*! \file
+  \brief Functor performing elementwise operations used by epilogues.
+*/
+
+#pragma once
+
+#include "cutlass/epilogue/collective/sm70_epilogue_vectorized.hpp"
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+namespace cutlass {
+namespace epilogue {
+namespace collective {
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+/// Ptr Array Epilogue Vectorized
+/// Applies an element wise operation to all elements within the fragment
+/// and writes it out to destination storage.
+///
+/// Ways to generalize this:
+/// - CTA tile shape
+/// - vectorization requirements (GMEM)
+/// - vectoriz(able) transform()
+///
+template <
+  class StrideC_,
+  class StrideD_,
+  class ThreadEpilogueOp_,
+  class SmemLayout_,
+  class CopyAtomR2S_,
+  class TiledCopyS2R_,
+  class CopyAtomR2G_,
+  class EpilogueScheduleType_
+>
+class Epilogue<
+        StrideC_,
+        StrideD_,
+        ThreadEpilogueOp_,
+        SmemLayout_,
+        CopyAtomR2S_,
+        TiledCopyS2R_,
+        CopyAtomR2G_,
+        EpilogueScheduleType_,
+        cute::enable_if_t<
+          cute::is_same_v<EpilogueScheduleType_, EpiloguePtrArraySimtVectorized>
+        >
+      > {
+public:
+  //
+  // Type Aliases
+  //
+  // derived types of output thread level operator
+  using ThreadEpilogueOp = ThreadEpilogueOp_;
+  using ElementAccumulator = typename ThreadEpilogueOp::ElementAccumulator;
+  using ElementCompute = typename ThreadEpilogueOp::ElementCompute;
+  using ElementScalar = ElementCompute;
+  using ElementOutput = typename ThreadEpilogueOp::ElementOutput;
+  using ElementC = typename ThreadEpilogueOp::ElementC;
+  using StrideC = StrideC_;
+  using InternalStrideC = cute::remove_pointer_t<StrideC>;
+  using ElementD = typename ThreadEpilogueOp::ElementD;
+  using StrideD = StrideD_;
+  using InternalStrideD = cute::remove_pointer_t<StrideD>;
+
+  using SmemLayout   = SmemLayout_;
+  using CopyAtomR2S  = CopyAtomR2S_;
+  using TiledCopyS2R = TiledCopyS2R_;
+  using CopyAtomR2G  = CopyAtomR2G_;
+
+  using GmemTiledCopyC = TiledCopyS2R;
+  using GmemTiledCopyD = TiledCopyS2R;
+
+  static const int kOutputAlignment = ThreadEpilogueOp::kCount;
+
+  using AlignmentType = typename cute::uint_bit<sizeof_bits<ElementOutput>::value * kOutputAlignment>::type;
+
+  static_assert(cute::rank(InternalStrideC{}) == 3, "StrideCD must be rank-3: [M, N, L]");
+  static_assert(cute::rank(InternalStrideD{}) == 3, "StrideCD must be rank-3: [M, N, L]");
+
+  struct SharedStorage
+  {
+    cute::array_aligned<ElementAccumulator, cute::cosize_v<SmemLayout>> smem_epilogue;
+  };
+
+  using TensorMapStorage = SharedStorage;
+
+  // Host side epilogue arguments
+  struct Arguments {
+    typename ThreadEpilogueOp::Params thread{};
+    ElementC const** ptr_C = nullptr;
+    StrideC dC{};
+    ElementD** ptr_D = nullptr;
+    StrideD dD{};
+  };
+
+  // Device side epilogue params
+  using Params = Arguments;
+
+  //
+  // Methods
+  //
+
+  template <class ProblemShape>
+  static constexpr Params
+  to_underlying_arguments(
+      ProblemShape const&,
+      Arguments const& args,
+      [[maybe_unused]] void* workspace) {
+    return args;
+  }
+
+  template <class ProblemShape>
+  static size_t
+  get_workspace_size(ProblemShape const& problem_shape, Arguments const& args, int sm_count) {
+    return 0;
+  }
+
+  template <class ProblemShape>
+  static cutlass::Status
+  initialize_workspace(ProblemShape const& problem_shape, Arguments const& args, void* workspace, cudaStream_t stream,
+    CudaHostAdapter* cuda_adapter = nullptr) {
+    return cutlass::Status::kSuccess;
+  }
+
+  template <class ProblemShape>
+  static bool
+  can_implement(
+      [[maybe_unused]] ProblemShape const& problem_shape,
+      [[maybe_unused]] Arguments const& args) {
+    return true;
+  }
+
+  CUTLASS_HOST_DEVICE
+  Epilogue(Params const& params_)
+      : params(params_) { }
+
+  CUTLASS_DEVICE
+  bool
+  is_source_needed() {
+    // For Ptr-Array or Grouped Gemm we cannot determine if source is needed based on first beta.
+    return true;
+  }
+
+  template<
+    class ProblemShapeMNKL,
+    class BlockShapeMNK,
+    class BlockCoordMNKL,
+    class FrgEngine, class FrgLayout,
+    class TiledMma,
+    class ResidueMNK
+  >
+  CUTLASS_DEVICE void
+  operator()(
+      ProblemShapeMNKL problem_shape_mnkl,
+      BlockShapeMNK blk_shape_MNK,
+      BlockCoordMNKL blk_coord_mnkl,
+      cute::Tensor<FrgEngine,FrgLayout> const& accumulators,                   // (MMA,MMA_M,MMA_N)
+      TiledMma tiled_mma,
+      ResidueMNK residue_mnk,
+      int thread_idx,
+      char* smem_buf) {
+    using namespace cute;
+    using X = Underscore;
+
+    static_assert(cute::rank(ProblemShapeMNKL{}) == 4, "ProblemShapeMNKL must be rank 4");
+    static_assert(is_static<BlockShapeMNK>::value, "ThreadBlock tile shape must be static");
+    static_assert(cute::rank(BlockShapeMNK{}) == 3, "BlockShapeMNK must be rank 3");
+    static_assert(cute::rank(BlockCoordMNKL{}) == 4, "BlockCoordMNKL must be rank 3");
+
+    // synchronizing function for smem reads/writes
+#if CUDA_BARRIER_ENABLED
+    auto synchronize = [] () { cutlass::arch::NamedBarrier::sync(typename TiledCopyS2R::TiledNumThr{}, cutlass::arch::ReservedNamedBarriers::EpilogueBarrier); };
+#else
+    auto synchronize = [] () { syncthreads(); };
+#endif
+
+    // Separate out problem shape for convenience
+    auto M = get<0>(problem_shape_mnkl);
+    auto N = get<1>(problem_shape_mnkl);
+    auto L = get<3>(problem_shape_mnkl);
+    // Batches are managed by using appropriate pointers to C and D matrices
+    const int32_t mock_L = 1;
+    const int32_t mock_l_coord = 0;
+    // Slice to get the tile this CTA is responsible for
+    auto [m_coord, n_coord, k_coord, l_coord] = blk_coord_mnkl;
+
+    // If scalar alpha/beta are provided, i.e., same alpha/beta applies to all batches/groups.
+    // If pointers to alpha/beta are provided, i.e., alpha/beta can differ between batches/groups,
+    // we get the correct alpha/beta values for the current batch/group using group index.
+    ThreadEpilogueOp epilogue_op = ThreadEpilogueOp(params.thread, l_coord);
+
+    if (epilogue_op.is_source_needed() && params.dC == nullptr) {
+      // Beta value is non-zero while pointer to C is a nullptr
+      assert(0);
+    }
+
+    InternalStrideC stride_c;
+    InternalStrideD stride_d;
+    if constexpr (!cute::is_same_v<InternalStrideC, StrideC>) {
+      // If grouped gemm
+      if (epilogue_op.is_source_needed()) {
+        stride_c = params.dC[l_coord];
+      }
+      stride_d = params.dD[l_coord];
+    }
+    else {
+      stride_c = params.dC;
+      stride_d = params.dD;
+    }
+
+    // Represent the full output tensor
+    ElementC const* ptr_C_l = nullptr;
+    if (epilogue_op.is_source_needed()) {
+      ptr_C_l = params.ptr_C[l_coord];
+    }
+    Tensor mC_mnl = make_tensor(make_gmem_ptr(ptr_C_l), make_shape(M,N,mock_L), stride_c);      //             (m,n,l)
+    Tensor mD_mnl = make_tensor(make_gmem_ptr(params.ptr_D[l_coord]), make_shape(M,N,mock_L), stride_d);      //             (m,n,l)
+    Tensor gC_mnl = local_tile(mC_mnl, blk_shape_MNK, make_coord(_,_,_), Step<_1,_1, X>{});      // (BLK_M,BLK_N,m,n,l)
+    Tensor gD_mnl = local_tile(mD_mnl, blk_shape_MNK, make_coord(_,_,_), Step<_1,_1, X>{});      // (BLK_M,BLK_N,m,n,l)
+
+    Tensor gC = gC_mnl(_,_,m_coord,n_coord,mock_l_coord);                                                   // (BLK_M,BLK_N)
+    Tensor gD = gD_mnl(_,_,m_coord,n_coord,mock_l_coord);                                                   // (BLK_M,BLK_N)
+
+    // Construct a tensor in SMEM that we can partition for rearranging data
+    SharedStorage& storage = *reinterpret_cast<SharedStorage*>(smem_buf);
+    Tensor sAcc = make_tensor(make_smem_ptr(storage.smem_epilogue.data()), SmemLayout{});            // (SMEM_M,SMEM_N)
+
+    // Partition sAcc to match the accumulator partitioning
+    auto tiled_r2s = make_tiled_copy_C(CopyAtomR2S{}, tiled_mma);
+    auto thread_r2s     = tiled_r2s.get_thread_slice(thread_idx);
+    Tensor tRS_rAcc = thread_r2s.retile_S(accumulators);                              // ((Atom,AtomNum), MMA_M, MMA_N)
+    Tensor tRS_sAcc = thread_r2s.partition_D(sAcc);                                   // ((Atom,AtomNum),PIPE_M,PIPE_N)
+
+    // Tile gD and gC by the shape of SmemLayout first
+    auto tile  = make_shape(size<0>(sAcc), size<1>(sAcc));
+    Tensor gCt = flat_divide(gC, tile);                                                // (SMEM_M,SMEM_N,TILE_M,TILE_N)
+    Tensor gDt = flat_divide(gD, tile);                                                // (SMEM_M,SMEM_N,TILE_M,TILE_N)
+
+    // Partition sAcc, gC, and gD for the output
+    auto tiled_s2r = TiledCopyS2R{};
+    auto thread_s2r     = tiled_s2r.get_thread_slice(thread_idx);
+    Tensor tSR_sAcc = thread_s2r.partition_S(sAcc);                      //               ((Atom,AtomNum),ATOM_M,ATOM_N)
+    Tensor tSR_gC = thread_s2r.partition_D(gCt);                         // ((Atom,AtomNum),ATOM_M,ATOM_N,TILE_M,TILE_N)
+    Tensor tSR_gD = thread_s2r.partition_D(gDt);                         // ((Atom,AtomNum),ATOM_M,ATOM_N,TILE_M,TILE_N)
+
+    // Allocate intermediate registers on the dst tensors
+    Tensor tSR_rAcc = make_tensor<ElementAccumulator>(take<0,3>(shape(tSR_gC)));       // ((Atom,AtomNum),ATOM_M,ATOM_N)
+    Tensor tSR_rD = make_tensor<ElementOutput>(shape(tSR_rAcc));                       // ((Atom,AtomNum),ATOM_M,ATOM_N)
+
+    // Repeat the D-partitioning for coordinates and predication
+    Tensor cD   = make_identity_tensor(make_shape(size<0>(gD),size<1>(gD)));           // (BLK_M,BLK_N) -> (blk_m,blk_n)
+    Tensor cDt  = flat_divide(cD, tile);                                 //                (SMEM_M,SMEM_N,TILE_M,TILE_N)
+    Tensor tSR_cD = thread_s2r.partition_D(cDt);                         // ((Atom,AtomNum),ATOM_M,ATOM_N,TILE_M,TILE_N)
+
+    CUTE_STATIC_ASSERT(size<1>(tRS_rAcc) % size<3>(tSR_gC) == 0);  // TILE_M divides MMA_M
+    CUTE_STATIC_ASSERT(size<2>(tRS_rAcc) % size<4>(tSR_gC) == 0);  // TILE_N divides MMA_N
+
+#if 0
+    if (thread_idx == 0 && m_coord == 0 && n_coord == 0) {
+      print("aC   : "); print(accumulators.layout()); print("\n");
+      print("gC   : "); print(gC.layout()); print("\n");
+      print("gD   : "); print(gD.layout()); print("\n");
+      print("sAcc   : "); print(sAcc.layout()); print("\n");
+      print("\n");
+      print("tRS_sAcc : "); print(tRS_sAcc.layout()); print("\n");
+      print("tRS_rAcc : "); print(tRS_rAcc.layout()); print("\n");
+      print("\n");
+      print("gDt  : "); print(gDt.layout()); print("\n");
+      print("tSR_sAcc : "); print(tSR_sAcc.layout()); print("\n");
+      print("tSR_rAcc : "); print(tSR_rAcc.layout()); print("\n");
+      print("\n");
+      print("tSR_rD : "); print(tSR_rD.layout()); print("\n");
+      print("tSR_gC : "); print(tSR_gC.layout()); print("\n");
+      print("tSR_gD : "); print(tSR_gD.layout()); print("\n");
+      print("\n");
+    }
+#endif
+
+    // For each tiling needed for SmemLayout to cover shape(gD)
+    CUTLASS_PRAGMA_UNROLL
+    for (int step_m = 0; step_m < size<2>(cDt); ++step_m) {
+      CUTLASS_PRAGMA_UNROLL
+      for (int step_n = 0; step_n < size<3>(cDt); ++step_n) {
+        // Step 1. Copy to SMEM
+        CUTLASS_PRAGMA_UNROLL
+        for (int pipe_m = 0; pipe_m < size<1>(tRS_sAcc); ++pipe_m) {
+          CUTLASS_PRAGMA_UNROLL
+          for (int pipe_n = 0; pipe_n < size<2>(tRS_sAcc); ++pipe_n) {
+            int mma_m = step_m * size<1>(tRS_sAcc) + pipe_m;
+            int mma_n = step_n * size<2>(tRS_sAcc) + pipe_n;
+
+            copy(tiled_r2s, tRS_rAcc(_,mma_m,mma_n), tRS_sAcc(_,pipe_m,pipe_n));
+          }
+        }
+
+        // Step 2. Wait for SMEM writes to complete
+        synchronize();
+
+        // Step 3. Copy from SMEM into a fragment
+        copy(tiled_s2r, tSR_sAcc, tSR_rAcc);
+
+        // Step 4. Wait for SMEM reads to complete
+        synchronize();
+
+        Tensor tSR_gDmn = tSR_gD(_,_,_,step_m,step_n);
+        Tensor tSR_cDmn = tSR_cD(_,_,_,step_m,step_n);
+
+        if (epilogue_op.is_source_needed()) {
+          // source is needed
+          Tensor tSR_gCmn = tSR_gC(_,_,_,step_m,step_n);
+
+          Tensor tSR_rCmn = make_tensor<ElementC>(shape(tSR_gCmn));                     // ((Atom,AtomNum),ATOM_M,ATOM_N)
+
+          // Step 5. Copy C from GMEM to a fragment
+          CUTLASS_PRAGMA_UNROLL
+          for (int m = 0; m < size<1>(tSR_gDmn); ++m) {
+            CUTLASS_PRAGMA_UNROLL
+            for (int n = 0; n < size<2>(tSR_gDmn); ++n) {
+              // Predication
+              if (elem_less(tSR_cDmn(0,m,n), take<0,2>(residue_mnk))) {
+                CUTLASS_PRAGMA_UNROLL
+                for (int i = 0; i < size<0>(tSR_rAcc); ++i) {
+                  tSR_rCmn(i,m,n) = tSR_gCmn(i,m,n);
+                }
+              }
+            }
+          }
+
+          CUTLASS_PRAGMA_UNROLL
+          for (int m = 0; m < size<1>(tSR_gDmn); ++m) {
+            CUTLASS_PRAGMA_UNROLL
+            for (int n = 0; n < size<2>(tSR_gDmn); ++n) {
+              // Predication
+              if (elem_less(tSR_cDmn(0,m,n), take<0,2>(residue_mnk))) {
+                // Step 6. Elementwise operation with conversion
+                CUTLASS_PRAGMA_UNROLL
+                for (int i = 0; i < size<0>(tSR_rAcc); ++i) {
+                  tSR_rD(i,m,n) = epilogue_op(tSR_rAcc(i,m,n), tSR_rCmn(i,m,n));
+                }
+                // Step 7. Copy to GMEM
+                copy(CopyAtomR2G{}, tSR_rD(_,m,n), tSR_gDmn(_,m,n));
+              }
+            }
+          }
+        }
+        else {
+          // source is not needed, avoid load and lift compute
+
+          // Step 5. Elementwise operation with conversion
+          CUTLASS_PRAGMA_UNROLL
+          for (int i = 0; i < size(tSR_rAcc); ++i) {
+            tSR_rD(i) = epilogue_op(tSR_rAcc(i));
+          }
+
+          CUTLASS_PRAGMA_UNROLL
+          for (int m = 0; m < size<1>(tSR_gDmn); ++m) {
+            CUTLASS_PRAGMA_UNROLL
+            for (int n = 0; n < size<2>(tSR_gDmn); ++n) {
+              // Predication
+              if (elem_less(tSR_cDmn(0,m,n), take<0,2>(residue_mnk))) {
+                // Step 6. Copy to GMEM
+                copy(CopyAtomR2G{}, tSR_rD(_,m,n), tSR_gDmn(_,m,n));
+              }
+            }
+          }
+        }
+      }
+    }
+  }
+
+private:
+  Params params;
+};
+
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+} // namespace collective
+} // namespace epilogue
+} // namespace cutlass
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/include/cutlass/epilogue/collective/sm90_epilogue_array_tma_warpspecialized.hpp b/include/cutlass/epilogue/collective/sm90_epilogue_array_tma_warpspecialized.hpp
index 87e628879c..ae095cf915 100644
--- a/include/cutlass/epilogue/collective/sm90_epilogue_array_tma_warpspecialized.hpp
+++ b/include/cutlass/epilogue/collective/sm90_epilogue_array_tma_warpspecialized.hpp
@@ -44,9 +44,10 @@
 #include "cutlass/detail/collective.hpp"
 #include "cutlass/detail/layout.hpp"
 #include "cutlass/trace.h"
+#include "cutlass/cuda_host_adapter.hpp"
 
 #include "cute/tensor.hpp"
-#include "cutlass/cuda_host_adapter.hpp"
+#include "cute/atom/copy_traits_sm90_tma.hpp"
 
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
@@ -62,6 +63,7 @@ template <
   int FragmentSize_,
   bool ReuseSmemC_,
   bool DelayTmaStore_,
+  int NumEpilogueWarpGroups_,
   class CtaTileMNK_,   //     (CTA_M,CTA_N,CTA_K)
   class EpilogueTile_, // (EPI_TILE_M,EPI_TILE_N)
   class ElementC_,
@@ -75,10 +77,17 @@ template <
   class CopyOpS2G_,
   class SmemLayoutAtomD_,
   class CopyOpR2S_,
-  class CopyAtomC_
+  class CopyAtomC_,
+  class CopyOpR2R_
 >
 class CollectiveEpilogue<
-    Sm90PtrArrayTmaWarpSpecialized<StagesC_,StagesD_,FragmentSize_,ReuseSmemC_,DelayTmaStore_>,
+    Sm90PtrArrayTmaWarpSpecialized<StagesC_,
+                                   StagesD_,
+                                   FragmentSize_,
+                                   ReuseSmemC_,
+                                   DelayTmaStore_,
+                                   NumEpilogueWarpGroups_
+                                  >,
     CtaTileMNK_,
     EpilogueTile_,
     ElementC_,
@@ -92,13 +101,20 @@ class CollectiveEpilogue<
     CopyOpS2G_,
     SmemLayoutAtomD_,
     CopyOpR2S_,
-    CopyAtomC_
+    CopyAtomC_,
+    CopyOpR2R_
 > {
 public:
   //
   // Type Aliases
   //
-  using DispatchPolicy = Sm90PtrArrayTmaWarpSpecialized<StagesC_,StagesD_,FragmentSize_,ReuseSmemC_,DelayTmaStore_>;
+  using DispatchPolicy = Sm90PtrArrayTmaWarpSpecialized<StagesC_,
+                                                        StagesD_,
+                                                        FragmentSize_,
+                                                        ReuseSmemC_,
+                                                        DelayTmaStore_, 
+                                                        NumEpilogueWarpGroups_
+                                                       >;
   using CtaTileMNK = CtaTileMNK_;
   using EpilogueTile = EpilogueTile_;
   using FusionCallbacks = FusionCallbacks_;
@@ -115,7 +131,7 @@ class CollectiveEpilogue<
   using SmemLayoutAtomD = SmemLayoutAtomD_;
   using CopyOpR2S = CopyOpR2S_;
   using CopyAtomC = CopyAtomC_;
-
+  using CopyOpR2R = CopyOpR2R_;
 
   using ThreadEpilogueOp = typename epilogue::fusion::FusionCallbacksTraits<FusionCallbacks>::Operation;
   using GmemTiledCopyC = CopyOpG2S;
@@ -150,6 +166,9 @@ class CollectiveEpilogue<
   constexpr static bool is_im2col_C = cute::is_same_v<CopyOpG2S, SM90_TMA_LOAD_IM2COL>;
   constexpr static bool is_im2col_D = cute::is_same_v<CopyOpS2G, SM90_TMA_STORE_IM2COL>;
 
+  // Check if register transformation is needed before copying register to shared memory.
+  constexpr static bool IsUseR2R = !cute::is_void_v<CopyOpR2R>;
+
   using SmemLayoutC = decltype(tile_to_shape(
       SmemLayoutAtomC{},
       make_shape(size<0>(EpilogueTile{}), size<1>(EpilogueTile{}), Int<StagesC>{}),
@@ -201,6 +220,8 @@ class CollectiveEpilogue<
     (size(take<0,2>(SmemLayoutC{})) * static_cast<uint32_t>(sizeof_bits<SmemElementC>::value)) / 8;
   constexpr static bool RequiresTransactionBytes = true;
 
+  constexpr static int NumEpilogueWarpGroups = NumEpilogueWarpGroups_;
+
   // TMA pipeline for storing D
   using StorePipeline = cute::conditional_t<ReuseSmemC,
                           cutlass::PipelineTmaStore<StagesC, StagesD-1>,
@@ -217,9 +238,9 @@ class CollectiveEpilogue<
       FusionStorage thread;
     } tensors;
 
-    struct TensorMapStorage : cute::aligned_struct<128> {
+    struct TensorMapStorage : cute::aligned_struct<128, _0> {
       cute::TmaDescriptor smem_tensormap_C;
-      cute::TmaDescriptor smem_tensormap_D;
+      cute::array<cute::TmaDescriptor, NumEpilogueWarpGroups> smem_tensormap_D;
     } tensormaps;
 
     using PipelineStorage = typename LoadPipeline::SharedStorage;
@@ -229,6 +250,8 @@ class CollectiveEpilogue<
   using TensorMapStorage = typename SharedStorage::TensorMapStorage;
   using PipelineStorage = typename SharedStorage::PipelineStorage;
 
+  static constexpr bool IsGroupedGemmKernel = !cute::is_same_v<InternalStrideC, StrideC>;
+
   // Host side epilogue arguments
   struct Arguments {
     typename FusionCallbacks::Arguments thread{};
@@ -247,7 +270,7 @@ class CollectiveEpilogue<
         take<0,2>(SmemLayoutC{}),
         EpilogueTile{},
         _1{}));
-  
+
     using TMA_D = decltype(make_tma_copy(
         CopyOpS2G{},
         make_tensor(make_gmem_ptr(static_cast<NonVoidElementD const*>(nullptr)),
@@ -261,7 +284,9 @@ class CollectiveEpilogue<
     TMA_D tma_store_d;
     cute::TmaDescriptor* tensormaps;
     ElementC const** ptr_C;
+    StrideC dC;
     ElementD** ptr_D;
+    StrideD dD;
     uint32_t tma_transaction_bytes = TmaTransactionBytes;
   };
 
@@ -275,36 +300,56 @@ class CollectiveEpilogue<
       ProblemShape const& problem_shape,
       Arguments const& args,
       [[maybe_unused]] void* workspace) {
-    // Optionally append 1s until problem shape is rank-4 in case its is only rank-3 (MNK)
-    auto problem_shape_MNKL = append<4>(problem_shape.get_host_problem_shape(), 1);
-    auto [M, N, K, mock_L] = problem_shape_MNKL;
-    // Manage batches/groups through pointers to input matricies
-    mock_L = 1;
+    // These tensor shapes (only applicable for grouped gemm) and pointers are only used to create tensormap/tma desc.
+    // These will be replaced with correct values before the initial tma load.
+    auto init_shape = repeat_like(append<4>(typename ProblemShape::UnderlyingProblemShape{}, 1), int32_t(1));
+    auto init_M = get<0>(init_shape);
+    auto init_N = get<1>(init_shape);
+    auto init_L = get<3>(init_shape);
 
     static_assert(!is_im2col_C and !is_im2col_D, "Im2Col not supported on C or D");
 
+    InternalStrideC stride_c;
+    InternalStrideD stride_d;
+    if constexpr (IsGroupedGemmKernel) {
+      // Strides for Grouped Gemm will be replaced prior to the first access regardless.
+      stride_c = InternalStrideC{};
+      stride_d = InternalStrideD{};
+    } 
+    else {
+      // Tensor shapes for Ptr-Array are initialized correctly only here.
+      auto problem_shape_MNKL = append<4>(problem_shape.get_host_problem_shape(0), 1);
+      init_M = get<0>(problem_shape_MNKL);
+      init_N = get<1>(problem_shape_MNKL);
+      init_L = get<3>(problem_shape_MNKL);
+
+      stride_c = args.dC;
+      stride_d = args.dD;
+    }
+
     uint32_t transaction_bytes = TmaTransactionBytes;
     typename Params::TMA_C tma_load_c = {};
     if constexpr (is_source_supported) {
       ElementC const* ptr_C_first_batch = reinterpret_cast<ElementC const*>(args.ptr_C); 
-      Tensor tensor_c = make_tensor(ptr_C_first_batch, make_layout(make_shape(M,N,mock_L), append<3>(args.dC, _0{})));
-      tma_load_c = make_tma_copy_C_sm90(
+      Tensor tensor_c = make_tensor(ptr_C_first_batch, make_layout(make_shape(init_M,init_N,init_L), append<3>(stride_c, _0{})));
+      tma_load_c = make_tma_copy(
           CopyOpG2S{},
           tensor_c,
           take<0,2>(SmemLayoutC{}),
-          EpilogueTile{});
-
+          EpilogueTile{},
+          _1{});
     }
 
     typename Params::TMA_D tma_store_d;
     if constexpr (is_destination_supported) {
       ElementD const* ptr_D_first_batch = reinterpret_cast<ElementD const*>(args.ptr_D);
-      Tensor tensor_d = make_tensor(ptr_D_first_batch, make_layout(make_shape(M,N,mock_L), append<3>(args.dD, _0{})));
-      tma_store_d = make_tma_copy_C_sm90(
+      Tensor tensor_d = make_tensor(ptr_D_first_batch, make_layout(make_shape(init_M,init_N,init_L), append<3>(stride_d, _0{})));
+      tma_store_d = make_tma_copy(
           CopyOpS2G{},
           tensor_d,
           take<0,2>(SmemLayoutD{}),
-          EpilogueTile{});
+          EpilogueTile{},
+          _1{});
     }
 
     auto fusion_workspace = static_cast<char*>(workspace);
@@ -318,7 +363,9 @@ class CollectiveEpilogue<
       tma_store_d,
       tma_descriptor_workspace,
       args.ptr_C,
+      args.dC,
       args.ptr_D,
+      args.dD,
       transaction_bytes,
     };
   }
@@ -326,15 +373,18 @@ class CollectiveEpilogue<
   template <class ProblemShape>
   static size_t
   get_workspace_size(ProblemShape const& problem_shape, Arguments const& args, int sm_count) {
-    constexpr uint32_t NumInputTensors = cute::is_void_v<ElementC> ? 1 : 2;
+    
+    constexpr uint32_t NumInputTensors = NumEpilogueWarpGroups + (cute::is_void_v<ElementC> ? 0 : 1);
+    auto descriptors_shape = cute::make_shape(sm_count, Int<NumInputTensors>{});
     constexpr size_t SizeOfCuTensorMap = sizeof(cute::TmaDescriptor);
+
     // Allocate gmem space for input tensormaps per each SM, A tensormap copies followed by B tensormap copies
-    return (NumInputTensors * SizeOfCuTensorMap * sm_count) + FusionCallbacks::get_workspace_size(problem_shape, args.thread);
+    return (size(descriptors_shape) * SizeOfCuTensorMap) + FusionCallbacks::get_workspace_size(problem_shape, args.thread);
   }
 
   template <class ProblemShape>
   static cutlass::Status
-  initialize_workspace(ProblemShape const& problem_shape, Arguments const& args, void* workspace, cudaStream_t stream, 
+  initialize_workspace(ProblemShape const& problem_shape, Arguments const& args, void* workspace, cudaStream_t stream,
     CudaHostAdapter* cuda_adapter = nullptr) {
     return FusionCallbacks::initialize_workspace(problem_shape, args.thread, workspace, stream, cuda_adapter);
   }
@@ -342,30 +392,40 @@ class CollectiveEpilogue<
   template <class ProblemShape>
   static bool
   can_implement(
-      ProblemShape const& problem_shape,
+      ProblemShape problem_shape,
       [[maybe_unused]] Arguments const& args) {
-    auto problem_shape_MNKL = append<4>(problem_shape.get_host_problem_shape(), 1);
-    auto [M,N,K,L] = problem_shape_MNKL;
 
     bool implementable = true;
-    if constexpr (is_destination_supported) {
-      constexpr int tma_alignment_bits_D = cutlass::detail::get_output_alignment_bits<ElementD>();
-      constexpr int min_tma_aligned_elements_D = tma_alignment_bits_D / cutlass::sizeof_bits<ElementD>::value;
-      implementable = cutlass::detail::check_alignment<min_tma_aligned_elements_D>(cute::make_shape(M,N,L), InternalStrideD{});
-    }
+    bool fusion_implementable = true;
+
+    if (problem_shape.is_host_problem_shape_available()) {
+      for (int i = 0; i < problem_shape.groups(); ++i) {
+        auto problem_shape_MNKL = append<4>(problem_shape.get_host_problem_shape(i), 1);
+        auto [M,N,K,L] = problem_shape_MNKL;
 
-    if constexpr (not cute::is_void_v<ElementC>) {
-      constexpr int tma_alignment_bits_C = cutlass::detail::get_input_alignment_bits<ElementC>();
-      constexpr int min_tma_aligned_elements_C = tma_alignment_bits_C / cutlass::sizeof_bits<ElementC>::value;
-      implementable = implementable && cutlass::detail::check_alignment<min_tma_aligned_elements_C>(cute::make_shape(M,N,L), InternalStrideC{});
+        if constexpr (is_destination_supported) {
+          constexpr int tma_alignment_bits_D = cutlass::detail::get_output_alignment_bits<ElementD>();
+          constexpr int min_tma_aligned_elements_D = tma_alignment_bits_D / cutlass::sizeof_bits<ElementD>::value;
+          implementable = implementable && cutlass::detail::check_alignment<min_tma_aligned_elements_D>(cute::make_shape(M,N,L), InternalStrideD{});
+        }
+
+        if constexpr (not cute::is_void_v<ElementC>) {
+          constexpr int tma_alignment_bits_C = cutlass::detail::get_input_alignment_bits<ElementC>();
+          constexpr int min_tma_aligned_elements_C = tma_alignment_bits_C / cutlass::sizeof_bits<ElementC>::value;
+          implementable = implementable && cutlass::detail::check_alignment<min_tma_aligned_elements_C>(cute::make_shape(M,N,L), InternalStrideC{});
+        }
+
+        fusion_implementable = fusion_implementable && FusionCallbacks::can_implement(problem_shape_MNKL, args.thread);
+      }
+    }
+    else {
+      CUTLASS_TRACE_HOST("  CAN IMPLEMENT: Ignoring check to can implement because host problem shape is not available.\n");
     }
 
     if (!implementable) {
       CUTLASS_TRACE_HOST("  CAN IMPLEMENT: Problem Size doesn't meet the minimum alignment requirements for TMA.\n");
     }
 
-    bool fusion_implementable = FusionCallbacks::can_implement(problem_shape, args.thread);
-
     if (!fusion_implementable) {
       CUTLASS_TRACE_HOST("  CAN IMPLEMENT: Problem Size doesn't meet the minimum requirements for FusionCallbacks.\n");
     }
@@ -414,10 +474,14 @@ class CollectiveEpilogue<
   }
 
   CUTLASS_DEVICE auto
-  load_init(Params const& params, int32_t const sm_count, int32_t const sm_idx) const {
+  load_init(
+      Params const& params,
+      TensorMapStorage& shared_tensormaps,
+      int32_t sm_count,
+      int32_t sm_idx) {
     // Initialize tma for loading
     constexpr bool IsLoad = true;
-    auto load_tensormaps = tensormaps_init<IsLoad>(params, sm_count, sm_idx);
+    auto load_tensormaps = tensormaps_init<IsLoad>(params, shared_tensormaps, sm_count, sm_idx, 0);
     return load_tensormaps;
   }
 
@@ -426,7 +490,8 @@ class CollectiveEpilogue<
     class TileShapeMNK,
     class TileCoordMNKL,
     class TiledMma,
-    class TensorMapC
+    class TensorMapC,
+    __CUTE_REQUIRES(std::is_pointer_v<TensorMapC>)
   >
   CUTLASS_DEVICE auto
   load(
@@ -440,7 +505,7 @@ class CollectiveEpilogue<
       TensorStorage& shared_tensors,
       TensorMapC const& load_tensormap,
       int subtile_idx=-1,
-      bool return_prior_state = false) {
+      bool wait_until_load_finishes = false) {
     using namespace cute;
 
     // Indexing variables
@@ -448,9 +513,9 @@ class CollectiveEpilogue<
     auto [m_coord, n_coord, k_coord, l_coord] = tile_coord_mnkl;
 
     static_assert(!is_im2col_D, "Do not support im2col");
-    
+
     auto coord_shape = append<3>(make_shape(m_coord, n_coord), Int<0>{});
-   
+
     // Represent the full source tensor, slice to get the tile this CTA is currently responsible for
     Tensor mC_mn = params.tma_load_c.get_tma_tensor(append<3>(make_shape(M,N), Int<1>{}));             //       (M,N,L)
     Tensor mC = coalesce(mC_mn, take<0,2>(CtaTileMNK{}));
@@ -478,17 +543,17 @@ class CollectiveEpilogue<
     auto pld_callbacks = fusion_callbacks.get_producer_load_callbacks(pld_args);
     bool is_C_load_needed = is_source_supported && fusion_callbacks.is_C_load_needed();
 
+    LoadPipelineState last_load_producer_state = load_pipe_producer_state;
+
     // Predication for TMA load (one thread issues TMA load)
     bool issue_tma_load = cute::elect_one_sync();
 
-    // Acquire the lock for the first stage
-    uint64_t* tma_barrier = load_pipeline.producer_get_barrier(load_pipe_producer_state);
-    load_pipeline.producer_acquire(load_pipe_producer_state);
-
     // Pre-loop fusion callback entry point
-    pld_callbacks.begin(tma_barrier, load_pipe_producer_state.count(), issue_tma_load);
+    pld_callbacks.begin();
+
+    LoadPipelineState prior_state = load_pipe_producer_state;
 
-    auto prior_state = load_pipe_producer_state;
+    bool did_load = false;
 
     CUTLASS_PRAGMA_UNROLL
     for (int epi_n = 0; epi_n < size<3>(gC_epi); ++epi_n) {
@@ -497,24 +562,29 @@ class CollectiveEpilogue<
         if (subtile_idx != -1 && (epi_n * static_cast<int>(size<2>(gC_epi)) + epi_m) != subtile_idx) {
           continue;
         }
+
         // Acquire the lock for this stage
         constexpr uint16_t mcast_mask = 0;
         uint64_t* tma_barrier = load_pipeline.producer_get_barrier(load_pipe_producer_state);
+
         load_pipeline.producer_acquire(load_pipe_producer_state);
 
         // Loop fusion callback entry point
         pld_callbacks.step(tma_barrier, epi_m, epi_n, load_pipe_producer_state.count(), issue_tma_load);
 
         // Execute the TMA load for C if needed
-        if (issue_tma_load && is_C_load_needed) {
-          copy(params.tma_load_c.with(load_tensormap, *tma_barrier, mcast_mask),
-              bGS_gC(_,_,_,epi_m,epi_n), bGS_sC(_,_,_,load_pipe_producer_state.index()));
-          load_pipeline.producer_expect_transaction(load_pipe_producer_state);
+        if (is_C_load_needed) {
+          if (issue_tma_load) {
+            copy(params.tma_load_c.with(load_tensormap, *tma_barrier, mcast_mask),
+                bGS_gC(_,_,_,epi_m,epi_n), bGS_sC(_,_,_,load_pipe_producer_state.index()));
+            load_pipeline.producer_expect_transaction(load_pipe_producer_state);
+          }
+          last_load_producer_state = load_pipe_producer_state;
+          did_load = true;
         }
 
         // Commit TMA loads for this stage and release the lock
         load_pipeline.producer_commit(load_pipe_producer_state);
-        prior_state = load_pipe_producer_state;
         ++load_pipe_producer_state;
       }
     }
@@ -522,17 +592,24 @@ class CollectiveEpilogue<
     // Post-loop fusion callback entry point
     pld_callbacks.end();
 
-    if (not return_prior_state) {
-      return load_pipe_producer_state;
-    } else {
-      return prior_state;
+    if (wait_until_load_finishes && did_load) {
+      typename CollectiveEpilogue::LoadPipelineState epi_load_pipe_tma_consumer_state =
+        {last_load_producer_state.index(), !last_load_producer_state.phase(), last_load_producer_state.count()};
+      load_pipeline.consumer_wait(epi_load_pipe_tma_consumer_state);
     }
+
+    return load_pipe_producer_state;
   }
 
   CUTLASS_DEVICE auto
   load_tail(
       LoadPipeline load_pipeline,
       LoadPipelineState load_pipe_producer_state) {
+
+    if (!fusion_callbacks.is_producer_load_needed()) {
+      return load_pipe_producer_state; 
+    }
+
     bool issue_tma_load = cute::elect_one_sync();
     if (issue_tma_load) {
       load_pipeline.producer_tail(load_pipe_producer_state);
@@ -564,6 +641,7 @@ class CollectiveEpilogue<
       TensorStorage& shared_tensors,
       TensorMapD const& store_tensormap,
       int subtile_idx=-1) {
+
     using namespace cute;
     using ElementAccumulator = typename AccEngine::value_type;
     using ElementCompute_ = typename epilogue::fusion::FusionCallbacksTraits<FusionCallbacks>::ElementCompute;
@@ -587,6 +665,7 @@ class CollectiveEpilogue<
 
     // Represent the full output tensor, slice to get the tile this CTA is responsible for
     Tensor mD_mn = params.tma_store_d.get_tma_tensor(append<3>(make_shape(M,N), Int<1>{}));            //       (M,N,L)
+
     Tensor mD = coalesce(mD_mn, take<0,2>(CtaTileMNK{}));
     Tensor gD = local_tile(mD, take<0,2>(CtaTileMNK{}), coord_shape);                                  // (CTA_M,CTA_N)
 
@@ -603,8 +682,27 @@ class CollectiveEpilogue<
 
     TiledCopy tiled_copy_C_atom = make_tiled_copy_C_atom(CopyAtomC{}, tiled_mma);
 
+    // (t)hread-partition for (r)egister to (r)egister copy (tRR_)
+    TiledCopy tiled_r2r = [&]() {
+      if constexpr (IsUseR2R) {
+        return make_tiled_copy_S(Copy_Atom<CopyOpR2R, ElementCompute>{}, tiled_copy_C_atom);
+      }
+      else {
+        return make_tiled_copy_S(Copy_Atom<AutoVectorizingCopyWithAssumedAlignment<128>,
+          ElementCompute>{}, tiled_copy_C_atom);
+      }
+    }();
+    ThrCopy thread_r2r = tiled_r2r.get_slice(thread_idx);
+
     // (t)hread-partition for (r)egister to (s)mem copy (tRS_)
-    TiledCopy tiled_r2s = make_tiled_copy_S(Copy_Atom<CopyOpR2S,SmemElementD>{}, tiled_copy_C_atom);
+    TiledCopy tiled_r2s = [&]() {
+      if constexpr (IsUseR2R) {
+        return make_tiled_copy_D(Copy_Atom<CopyOpR2S,SmemElementD>{}, tiled_r2r);
+      }
+      else {
+        return make_tiled_copy_S(Copy_Atom<CopyOpR2S,SmemElementD>{}, tiled_copy_C_atom);
+      }
+    }();
     ThrCopy thread_r2s = tiled_r2s.get_slice(thread_idx);
     Tensor tRS_rAcc = thread_r2s.retile_S(accumulators);                                   // ((R2S,R2S_V),MMA_M,MMA_N)
     Tensor tRS_sD   = thread_r2s.partition_D(sD_epi);                                       // (R2S,R2S_M,R2S_N,PIPE_D)
@@ -659,6 +757,8 @@ class CollectiveEpilogue<
     CUTE_STATIC_ASSERT(epi_tile_m % mma_tile_m == 0, "MMA_TILE_M must divide EPI_TILE_M");
 
     CUTE_STATIC_ASSERT(mma_tile_n % epi_tile_n == 0, "EPI_TILE_N must divide MMA_TILE_N");
+    // Get TiledCopy for partition reference when consumer store.
+    TiledCopy tiled_copy_partition_ref = make_tiled_copy_S(Copy_Atom<CopyOpR2S,SmemElementD>{}, tiled_copy_C_atom);
     // Get the fusion callbacks for the consumer store warps
     constexpr bool RefSrc = true; // Register tensors reference R2S copy src layout
     auto cst_args = cutlass::epilogue::fusion::detail::ConsumerStoreArgs{
@@ -667,7 +767,7 @@ class CollectiveEpilogue<
                       tile_coord_mnkl,
                       tiled_mma,
                       EpilogueTile{},
-                      tiled_r2s,
+                      tiled_copy_partition_ref,
                       cD,
                       residue_cD,
                       tRS_cD,
@@ -700,7 +800,7 @@ class CollectiveEpilogue<
     // Sync requirements of smem reuse may preclude this optimization
     // Delayed stores cause delayed stage releases which causes deadlock when StagesC == StagesD
     int epi_m_prev = 0, epi_n_prev = 0;
-    static_assert(not (DelayTmaStore and ReuseSmemC and StagesC == StagesD), "This TMA epilogue configuration will deadlock");
+    static_assert(not (DelayTmaStore and ReuseSmemC and StagesC <= StagesD), "This TMA epilogue configuration will deadlock");
 
     // The TMA store sequence for one subtile iteration
     auto tma_store_fn = [&] (int epi_m, int epi_n) {
@@ -812,6 +912,16 @@ class CollectiveEpilogue<
         cst_callbacks.reduce(sD_epi(_,_,store_pipe_producer_state.index()),
                               synchronize, epi_m, epi_n, is_last_iteration, tRS_rD_frg);
 
+        // Copy tile from register to regiser if needed
+        if constexpr (IsUseR2R) {
+          // retile source and destination for tiled_r2r
+          Tensor tRR_rD_src = thread_r2r.retile_S(tRS_rD);                             // (R2R,R2R_M,R2R_N,EPI_M,EPI_N)
+          Tensor tRR_rD_dst = thread_r2r.retile_D(tRS_rD);                             // (R2R,R2R_M,R2R_N,EPI_M,EPI_N)
+
+          // Output needs register shuffling before copying to shared memory.
+          copy(tiled_r2r, tRR_rD_src, tRR_rD_dst);
+        }
+
         // Copy tile from register to smem
         if constexpr (is_destination_supported) {
           copy(tiled_r2s, tRS_rD, tRS_sD(_,_,_,store_pipe_producer_state.index()));
@@ -831,6 +941,7 @@ class CollectiveEpilogue<
       } // for epi_m
     } // for epi_n
 
+
     if constexpr (DelayTmaStore) {
       // Issue TMA stores for the last subtile
       tma_store_fn(epi_m_prev, epi_n_prev);
@@ -869,11 +980,22 @@ class CollectiveEpilogue<
   }
 
   CUTLASS_DEVICE auto
-  store_init(Params const& params, int32_t const sm_count, int32_t const sm_idx) const {
-    // Initialize tma
-    constexpr bool IsLoad = false;
-    auto store_tensormaps = tensormaps_init<IsLoad>(params, sm_count, sm_idx);
-    return store_tensormaps;
+  store_init(
+      Params const& params,
+      TensorMapStorage& shared_tensormaps,
+      int32_t sm_count,
+      int32_t sm_idx,
+      int32_t warp_group_idx) {
+    int warp_idx_in_warp_group = canonical_warp_idx_sync() % NumWarpsPerWarpGroup;
+    // Since only one warp issues TMA store, we only need that one warp to initialize tensormaps
+    if (warp_idx_in_warp_group == 0) {
+      // Initialize tma
+      constexpr bool IsLoad = false;
+      auto store_tensormaps = tensormaps_init<IsLoad>(params, shared_tensormaps, sm_count, sm_idx, warp_group_idx);
+      return store_tensormaps;
+    }
+    TmaDescriptor* null_tma_desc = nullptr;
+    return cute::make_tuple(null_tma_desc);
   }
 
   //
@@ -882,89 +1004,141 @@ class CollectiveEpilogue<
 
   template <bool IsLoad>
   CUTLASS_DEVICE auto
-  tensormaps_init(Params const& params, int32_t const sm_count, int32_t const sm_idx) const {
-    cute::TmaDescriptor* tma_desc = nullptr;
-    cute::TmaDescriptor* gmem_tensormap = params.tensormaps;
+  tensormaps_init(
+      Params const& params,
+      TensorMapStorage& shared_tensormaps,
+      int32_t sm_count,
+      int32_t sm_idx,
+      int32_t warp_group_idx) {
+
+    constexpr uint32_t NumInputTensors = NumEpilogueWarpGroups + (cute::is_void_v<ElementC> ? 0 : 1);
+    Layout desc_layout = make_layout(make_shape(sm_count, Int<NumInputTensors>{}));
+
+    Tensor gmem_tensormap = make_tensor(params.tensormaps, desc_layout);                      // (SMs, NumInputTensors)
+
     if constexpr (IsLoad) {
       if (not cute::is_void_v<ElementC>) {
-        tma_desc = &gmem_tensormap[sm_idx];
+        constexpr int C_tensormap_index = NumEpilogueWarpGroups;
+        Tensor pC_tensormap = make_tensor(params.tma_load_c.get_tma_descriptor(), Int<1>{}, Int<1>{});
+        Tensor sC_tensormap = make_tensor(make_smem_ptr(&shared_tensormaps.smem_tensormap_C), Int<1>{}, Int<1>{});
+
         if (cute::elect_one_sync()) {
-          // Bringing tensormaps from params to gmem for modification later
-          Tensor pC_tensormap = make_tensor(params.tma_load_c.get_tma_descriptor(), Int<1>{}, Int<1>{});
-          Tensor gC_tensormap = make_tensor(tma_desc, Int<1>{}, Int<1>{});
-          copy(recast<uint128_t>(pC_tensormap), recast<uint128_t>(gC_tensormap));
+          // Bringing tensormaps from params to smem for modification later
+          copy(recast<uint128_t>(pC_tensormap), recast<uint128_t>(sC_tensormap));
         }
+        syncwarp();
+        return cute::make_tuple(&gmem_tensormap(sm_idx, C_tensormap_index));
+
       }
-    } else {
-      int const offset_Ddesc = cute::is_void_v<ElementC> ? 0 : sm_count;
-      tma_desc = &gmem_tensormap[sm_idx + offset_Ddesc];
+      TmaDescriptor* null_tma_desc = nullptr;
+      return cute::make_tuple(null_tma_desc);
+    }
+    else {
+      Tensor pD_tensormap = make_tensor(params.tma_store_d.get_tma_descriptor(), Int<1>{}, Int<1>{});
+      Tensor sD_tensormap = make_tensor(make_smem_ptr(&shared_tensormaps.smem_tensormap_D[warp_group_idx]), Int<1>{}, Int<1>{});
+
       if (cute::elect_one_sync()) {
-        // Bringing tensormaps from params to gmem for modification later
-        Tensor pD_tensormap = make_tensor(params.tma_store_d.get_tma_descriptor(), Int<1>{}, Int<1>{});
-        Tensor gD_tensormap = make_tensor(tma_desc, Int<1>{}, Int<1>{});
-        copy(recast<uint128_t>(pD_tensormap), recast<uint128_t>(gD_tensormap));
+        // Bringing tensormaps from params to smem for modification later
+        copy(recast<uint128_t>(pD_tensormap), recast<uint128_t>(sD_tensormap));
       }
+      syncwarp();
+      return cute::make_tuple(&gmem_tensormap(sm_idx, warp_group_idx));
     }
-
-    return cute::make_tuple(tma_desc);
   }
 
-  // Bringing tensormaps to smem (to be done by single thread)
+  // Replace address for the global tensor (to be done by single thread)
   template <bool IsLoad>
   CUTLASS_DEVICE
   void
-  tensormaps_fetch_to_smem(
-      TensorMapStorage& shared_tensormap,
-      cute::TmaDescriptor const* tensormap) const {
+  tensormaps_replace_global_address(
+      TensorMapStorage& shared_tensormaps,
+      Params const& params,
+      int32_t next_batch,
+      int32_t warp_group_idx) {
+    // Replacing global_address for the next batch
     if constexpr (IsLoad) {
-      if (not cute::is_void_v<ElementC>) {
-        Tensor gC_tensormap = make_tensor(make_gmem_ptr(tensormap), Int<1>{}, Int<1>{});
-        Tensor sC_tensormap = make_tensor(make_smem_ptr(&shared_tensormap.smem_tensormap_C), Int<1>{}, Int<1>{});
-        copy(recast<uint128_t>(gC_tensormap), recast<uint128_t>(sC_tensormap));
+      if constexpr (is_source_supported) {
+        cute::tma_descriptor_replace_addr_in_shared_mem(shared_tensormaps.smem_tensormap_C,
+                                                        params.ptr_C[next_batch]);
       }
-    } else {
-      Tensor gD_tensormap = make_tensor(make_gmem_ptr(tensormap), Int<1>{}, Int<1>{});
-      Tensor sD_tensormap = make_tensor(make_smem_ptr(&shared_tensormap.smem_tensormap_D), Int<1>{}, Int<1>{});
-      copy(recast<uint128_t>(gD_tensormap), recast<uint128_t>(sD_tensormap));
     }
-    cp_async_fence();
-    cp_async_wait<0>();
+    else if constexpr (is_destination_supported) {
+      cute::tma_descriptor_replace_addr_in_shared_mem(shared_tensormaps.smem_tensormap_D[warp_group_idx],
+                                                      params.ptr_D[next_batch]);
+    }
   }
 
-  // Replace address for the global tensor (to be done by single thread)
-  template <bool IsLoad>
+  // Replace dim and strides for the global tensor - used only for Grouped GEMM (to be done by single thread)
+  template <bool IsLoad, class ProblemShape_MNKL>
   CUTLASS_DEVICE
   void
-  tensormaps_replace_global_address(
-      TensorMapStorage& shared_tensormap,
+  tensormaps_replace_global_tensor_properties(
+      TensorMapStorage& shared_tensormaps,
       Params const& params,
-      int32_t next_batch) {
-    // Replacing global_address for the next batch
+      int32_t next_group,
+      ProblemShape_MNKL problem_shape_mnkl,
+      int32_t warp_group_idx) {
+    const uint32_t M = get<0>(problem_shape_mnkl);
+    const uint32_t N = get<1>(problem_shape_mnkl);
+    // Replace all dims for consistency
+    constexpr int MaxTensorRank = 5;
+    cute::array<uint32_t, MaxTensorRank> prob_shape  = {1,1,1,1,1};
+    cute::array<uint64_t, MaxTensorRank> prob_stride = {0,0,0,0,0};
+
     if constexpr (IsLoad) {
-      if (not cute::is_void_v<ElementC>) {
-        cute::tma_descriptor_replace_addr_in_shared_mem(shared_tensormap.smem_tensormap_C,
-                                                        params.ptr_C[next_batch]);
+      if constexpr (is_source_supported) {
+        ElementC const* ptr_C = nullptr;
+        Tensor tensor_c = make_tensor(ptr_C, make_layout(make_shape(M,N,Int<1>{}), params.dC[next_group]));
+
+        cute::detail::fill_tma_gmem_shape_stride(params.tma_load_c, tensor_c, 
+                                                 prob_shape, prob_stride);
+        // Convert strides to byte strides
+        for (uint64_t& stride : prob_stride) {
+          stride = (stride * sizeof_bits_v<ElementC>) / 8;
+        }
+        cute::tma_descriptor_replace_dims_strides_in_shared_mem(shared_tensormaps.smem_tensormap_C,
+                                                                prob_shape,
+                                                                prob_stride);
       }
-    } else {
-      cute::tma_descriptor_replace_addr_in_shared_mem(shared_tensormap.smem_tensormap_D,
-                                                      params.ptr_D[next_batch]);
+    }
+    else if constexpr (is_destination_supported) {
+      ElementD const* ptr_D = nullptr;
+      Tensor tensor_d = make_tensor(ptr_D, make_layout(make_shape(M,N,Int<1>{}), params.dD[next_group]));
+
+      cute::detail::fill_tma_gmem_shape_stride(params.tma_store_d, tensor_d, 
+                                               prob_shape, prob_stride);
+      // Convert strides to byte strides
+      for (uint64_t& stride : prob_stride) {
+        stride = (stride * sizeof_bits_v<ElementD>) / 8;
+      }
+
+      cute::tma_descriptor_replace_dims_strides_in_shared_mem(shared_tensormaps.smem_tensormap_D[warp_group_idx],
+                                                              prob_shape,
+                                                              prob_stride);
     }
   }
 
-  template <bool IsLoad>
+  template <bool IsLoad, class ProblemShape_MNKL>
   CUTLASS_DEVICE
   void
   tensormaps_perform_update(
-      TensorMapStorage& shared_tensormap,
+      TensorMapStorage& shared_tensormaps,
       Params const& params,
       cute::TmaDescriptor const* tensormap,
-      int32_t next_batch) {
-    if (cute::elect_one_sync()) {
-      // Bringing tensormaps to smem
-      tensormaps_fetch_to_smem<IsLoad>(shared_tensormap, tensormap);
+      ProblemShape_MNKL problem_shape_mnkl,
+      int32_t next_batch,
+      int32_t warp_group_idx) {
 
+    if (cute::elect_one_sync()) {
       // Replacing global_address for the next batch
-      tensormaps_replace_global_address<IsLoad>(shared_tensormap, params, next_batch);
+      tensormaps_replace_global_address<IsLoad>(shared_tensormaps, params, next_batch, warp_group_idx);
+
+      if constexpr (IsGroupedGemmKernel) {
+        // Replacing global dims and strides for the next batch
+        tensormaps_replace_global_tensor_properties<IsLoad>(
+            shared_tensormaps, params, next_batch, problem_shape_mnkl, warp_group_idx);
+      }
+
     }
   }
 
@@ -972,16 +1146,18 @@ class CollectiveEpilogue<
   CUTLASS_DEVICE
   void
   tensormaps_cp_fence_release(
-      TensorMapStorage& shared_tensormap,
+      TensorMapStorage& shared_tensormaps,
       cute::TmaDescriptor const* tensormap,
-      [[maybe_unused]] uint32_t lane_predicate) {
+      const int32_t warp_group_idx = 0) {
+
     // Entire warp must do this (ie its aligned)
     if constexpr (IsLoad) {
-      if (not cute::is_void_v<ElementC>) {
-        tma_descriptor_cp_fence_release(tensormap, shared_tensormap.smem_tensormap_C);
+      if constexpr (is_source_supported) {
+        tma_descriptor_cp_fence_release(tensormap, shared_tensormaps.smem_tensormap_C);
       }
-    } else {
-      tma_descriptor_cp_fence_release(tensormap, shared_tensormap.smem_tensormap_D);
+    }
+    else if constexpr (is_destination_supported) {
+      tma_descriptor_cp_fence_release(tensormap, shared_tensormaps.smem_tensormap_D[warp_group_idx]);
     }
   }
 
@@ -990,10 +1166,11 @@ class CollectiveEpilogue<
   void
   tensormaps_fence_acquire(cute::TmaDescriptor const* tensormap) {
     if constexpr (IsLoad) {
-      if (not cute::is_void_v<ElementC>) {
+      if constexpr (not cute::is_void_v<ElementC>) {
         cute::tma_descriptor_fence_acquire(tensormap);
       }
-    } else {
+    } 
+    else {
       cute::tma_descriptor_fence_acquire(tensormap);
     }
   }
diff --git a/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp b/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp
index 56b55292a8..1ecf854085 100644
--- a/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp
+++ b/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized.hpp
@@ -75,7 +75,8 @@ template <
   class CopyOpS2G_,
   class SmemLayoutAtomD_,
   class CopyOpR2S_,
-  class CopyAtomC_
+  class CopyAtomC_,
+  class CopyOpR2R_
 >
 class CollectiveEpilogue<
     Sm90TmaWarpSpecialized<StagesC_,StagesD_,FragmentSize_,ReuseSmemC_,DelayTmaStore_>,
@@ -92,7 +93,8 @@ class CollectiveEpilogue<
     CopyOpS2G_,
     SmemLayoutAtomD_,
     CopyOpR2S_,
-    CopyAtomC_
+    CopyAtomC_,
+    CopyOpR2R_
 > {
 public:
   //
@@ -113,6 +115,7 @@ class CollectiveEpilogue<
   using SmemLayoutAtomD = SmemLayoutAtomD_;
   using CopyOpR2S = CopyOpR2S_;
   using CopyAtomC = CopyAtomC_;
+  using CopyOpR2R = CopyOpR2R_;
 
   using ThreadEpilogueOp = typename epilogue::fusion::FusionCallbacksTraits<FusionCallbacks>::Operation;
   using GmemTiledCopyC = CopyOpG2S;
@@ -147,6 +150,9 @@ class CollectiveEpilogue<
   constexpr static bool is_im2col_C = cute::is_same_v<CopyOpG2S, SM90_TMA_LOAD_IM2COL>;
   constexpr static bool is_im2col_D = cute::is_same_v<CopyOpS2G, SM90_TMA_STORE_IM2COL>;
 
+  // Check if register transformation is needed before copying register to shared memory.
+  constexpr static bool IsUseR2R = !cute::is_void_v<CopyOpR2R>;
+
   using SmemLayoutC = decltype(tile_to_shape(
       SmemLayoutAtomC{},
       make_shape(size<0>(EpilogueTile{}), size<1>(EpilogueTile{}), Int<StagesC>{}),
@@ -321,13 +327,23 @@ class CollectiveEpilogue<
     if constexpr (is_destination_supported) {
       constexpr int tma_alignment_bits_D = cutlass::detail::get_output_alignment_bits<ElementD>();
       constexpr int min_tma_aligned_elements_D = tma_alignment_bits_D / cutlass::sizeof_bits<ElementD>::value;
-      implementable = cutlass::detail::check_alignment<min_tma_aligned_elements_D>(shape, StrideD{});
+      if constexpr (cute::is_same_v<CopyOpS2G, SM90_TMA_STORE_IM2COL>) { // ignore L stride for implicit gemm
+        implementable = cutlass::detail::check_alignment<min_tma_aligned_elements_D>(take<0,2>(shape), take<0,2>(StrideD{}));
+      }
+      else {
+        implementable = cutlass::detail::check_alignment<min_tma_aligned_elements_D>(shape, StrideD{});
+      }
     }
 
     if constexpr (not cute::is_void_v<ElementC>) {
       constexpr int tma_alignment_bits_C = cutlass::detail::get_input_alignment_bits<ElementC>();
       constexpr int min_tma_aligned_elements_C = tma_alignment_bits_C / cutlass::sizeof_bits<ElementC>::value;
-      implementable = implementable && cutlass::detail::check_alignment<min_tma_aligned_elements_C>(shape, StrideC{});
+      if constexpr (cute::is_same_v<CopyOpG2S, SM90_TMA_LOAD_IM2COL>) { // ignore L stride for implicit gemm
+        implementable = implementable && cutlass::detail::check_alignment<min_tma_aligned_elements_C>(take<0,2>(shape), take<0,2>(StrideC{}));
+      }
+      else {
+        implementable = implementable && cutlass::detail::check_alignment<min_tma_aligned_elements_C>(shape, StrideC{});
+      }
     }
 
     if (!implementable) {
@@ -454,12 +470,8 @@ class CollectiveEpilogue<
     // Predication for TMA load (one thread issues TMA load)
     bool issue_tma_load = cute::elect_one_sync();
 
-    // Acquire the lock for the first stage
-    uint64_t* tma_barrier = load_pipeline.producer_get_barrier(load_pipe_producer_state);
-    load_pipeline.producer_acquire(load_pipe_producer_state);
-
     // Pre-loop fusion callback entry point
-    pld_callbacks.begin(tma_barrier, load_pipe_producer_state.count(), issue_tma_load);
+    pld_callbacks.begin();
 
     CUTLASS_PRAGMA_UNROLL
     for (int epi_n = 0; epi_n < size<3>(gC_epi); ++epi_n) {
@@ -568,8 +580,27 @@ class CollectiveEpilogue<
 
     TiledCopy tiled_copy_C_atom = make_tiled_copy_C_atom(CopyAtomC{}, tiled_mma);
 
+    // (t)hread-partition for (r)egister to (r)egister copy (tRR_)
+    TiledCopy tiled_r2r = [&]() {
+      if constexpr (IsUseR2R) {
+        return make_tiled_copy_S(Copy_Atom<CopyOpR2R, ElementCompute>{}, tiled_copy_C_atom);
+      }
+      else {
+        return make_tiled_copy_S(Copy_Atom<AutoVectorizingCopyWithAssumedAlignment<128>,
+          ElementCompute>{}, tiled_copy_C_atom);
+      }
+    }();
+    ThrCopy thread_r2r = tiled_r2r.get_slice(thread_idx);
+
     // (t)hread-partition for (r)egister to (s)mem copy (tRS_)
-    TiledCopy tiled_r2s = make_tiled_copy_S(Copy_Atom<CopyOpR2S,SmemElementD>{}, tiled_copy_C_atom);
+    TiledCopy tiled_r2s = [&]() {
+      if constexpr (IsUseR2R) {
+        return make_tiled_copy_D(Copy_Atom<CopyOpR2S,SmemElementD>{}, tiled_r2r);
+      }
+      else {
+        return make_tiled_copy_S(Copy_Atom<CopyOpR2S,SmemElementD>{}, tiled_copy_C_atom);
+      }
+    }();
     ThrCopy thread_r2s = tiled_r2s.get_slice(thread_idx);
     Tensor tRS_rAcc = thread_r2s.retile_S(accumulators);                                   // ((R2S,R2S_V),MMA_M,MMA_N)
     Tensor tRS_sD   = thread_r2s.partition_D(sD_epi);                                       // (R2S,R2S_M,R2S_N,PIPE_D)
@@ -581,7 +612,7 @@ class CollectiveEpilogue<
 
     // Allocate D registers
     Layout tRS_rD_layout = make_layout(take<0,3>(shape(thread_r2s.partition_S(sD_epi))));
-    Tensor tRS_rD = make_tensor<SmemElementD>(tRS_rD_layout);                                          // (R2S,R2S_M,R2S_N)
+    Tensor tRS_rD = make_tensor<SmemElementD>(tRS_rD_layout);                                      // (R2S,R2S_M,R2S_N)
 
     // Vectorized fragment view
     constexpr int FragmentSize = DispatchPolicy::FragmentSize;
@@ -624,15 +655,17 @@ class CollectiveEpilogue<
     CUTE_STATIC_ASSERT(epi_tile_m % mma_tile_m == 0, "MMA_TILE_M must divide EPI_TILE_M");
 
     CUTE_STATIC_ASSERT(mma_tile_n % epi_tile_n == 0, "EPI_TILE_N must divide MMA_TILE_N");
+    // Get TiledCopy for partition reference when consumer store.
+    TiledCopy tiled_copy_partition_ref = make_tiled_copy_S(Copy_Atom<CopyOpR2S,SmemElementD>{}, tiled_copy_C_atom);
     // Get the fusion callbacks for the consumer store warps
-    constexpr bool RefSrc = true; // Register tensors reference R2S copy src layout
+    constexpr bool RefSrc = true; // Register tensors reference tiled copy src layout
     auto cst_args = cutlass::epilogue::fusion::detail::ConsumerStoreArgs(
                       problem_shape_mnkl,
                       CtaTileMNK{},
                       tile_coord_mnkl,
                       tiled_mma,
                       EpilogueTile{},
-                      tiled_r2s,
+                      tiled_copy_partition_ref,
                       cD,
                       residue_cD,
                       tRS_cD,
@@ -640,14 +673,14 @@ class CollectiveEpilogue<
                       tRS_rC,
                       thread_idx
                     );
-    auto cst_callbacks = fusion_callbacks.get_consumer_store_callbacks<RefSrc>(cst_args);
+    auto cst_callbacks = fusion_callbacks.template get_consumer_store_callbacks<RefSrc>(cst_args);
     bool is_producer_load_needed = fusion_callbacks.is_producer_load_needed();
     bool is_C_load_needed = is_source_supported && fusion_callbacks.is_C_load_needed();
 
     using FragmentVisit = decltype(cst_callbacks.visit(tRS_rAcc_frg(0), 0, 0, 0));
     constexpr bool IsDirectR2S = cute::is_same_v<FragmentVisit, Array<SmemElementD, FragmentSize>>;
     using RegisterElementD = cute::conditional_t<!IsDirectR2S, ElementCompute, SmemElementD>;
-    Tensor tRS_rCompute = make_tensor<RegisterElementD>(tRS_rD_layout);                         // (R2S,R2S_M,R2S_N)
+    Tensor tRS_rCompute = make_tensor<RegisterElementD>(tRS_rD_layout);                            // (R2S,R2S_M,R2S_N)
     Tensor tRS_rCompute_frg = recast<Array<RegisterElementD, FragmentSize>>(tRS_rCompute);
 
     // Thread synchronizer for previously issued waits or fences
@@ -670,8 +703,9 @@ class CollectiveEpilogue<
     // We can delay issue of TMA store by one iteration to achieve better interleaving of non-TMA instructions
     // Sync requirements of smem reuse may preclude this optimization
     // Delayed stores cause delayed stage releases which causes deadlock when StagesC == StagesD
-    int epi_m_prev = 0, epi_n_prev = 0;
-    static_assert(not (DelayTmaStore and ReuseSmemC and StagesC == StagesD), "This TMA epilogue configuration will deadlock");
+    [[maybe_unused]] int epi_m_prev = 0;
+    [[maybe_unused]] int epi_n_prev = 0;
+    static_assert(not (DelayTmaStore and ReuseSmemC and StagesC <= StagesD), "This TMA epilogue configuration will deadlock");
 
     // The TMA store sequence for one subtile iteration
     auto tma_store_fn = [&] (int epi_m, int epi_n) {
@@ -725,7 +759,7 @@ class CollectiveEpilogue<
     for (int epi_n = 0; epi_n < size<3>(gD_epi); ++epi_n) {
       CUTLASS_PRAGMA_UNROLL
       for (int epi_m = 0; epi_m < size<2>(gD_epi); ++epi_m) {
-        bool is_first_iteration = epi_m == 0 && epi_n == 0;
+        [[maybe_unused]] bool is_first_iteration = epi_m == 0 && epi_n == 0;
         bool is_last_iteration = epi_m == size<2>(gD_epi)-1 && epi_n == size<3>(gD_epi)-1;
 
         if (subtile_idx != -1 && (epi_n * static_cast<int>(size<2>(gD_epi)) + epi_m) != subtile_idx) {
@@ -783,6 +817,16 @@ class CollectiveEpilogue<
         cst_callbacks.reduce(sD_epi(_,_,store_pipe_producer_state.index()),
                               synchronize, epi_m, epi_n, is_last_iteration, tRS_rCompute_frg);
 
+        // Copy tile from register to regiser if needed
+        if constexpr (IsUseR2R) {
+          // retile source and destination for tiled_r2r
+          Tensor tRR_rD_src = thread_r2r.retile_S(tRS_rCompute);                             // (R2R,R2R_M,R2R_N,EPI_M,EPI_N)
+          Tensor tRR_rD_dst = thread_r2r.retile_D(tRS_rCompute);                             // (R2R,R2R_M,R2R_N,EPI_M,EPI_N)
+
+          // Output register transformation before copying to shared memory.
+          copy(tiled_r2r, tRR_rD_src, tRR_rD_dst);
+        }
+
         CUTLASS_PRAGMA_UNROLL
         for (int i = 0; i < size(tRS_rD_frg); ++i) {
           tRS_rD_frg(i) = cutlass::NumericArrayConverter<SmemElementD, RegisterElementD, FragmentSize>{}(tRS_rCompute_frg(i));
diff --git a/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized_bias_elementwise.hpp b/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized_bias_elementwise.hpp
index b67c229c27..9749040081 100644
--- a/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized_bias_elementwise.hpp
+++ b/include/cutlass/epilogue/collective/sm90_epilogue_tma_warpspecialized_bias_elementwise.hpp
@@ -62,7 +62,8 @@ template <
   class CopyOpS2G_,
   class SmemLayoutAtomD_,
   class CopyOpR2S_,
-  class CopyAtomC_
+  class CopyAtomC_,
+  class CopyOpR2R_
 >
 class Sm90EpilogueTmaWarpSpecializedBiasElementwise
   : public CollectiveEpilogue<
@@ -80,7 +81,8 @@ class Sm90EpilogueTmaWarpSpecializedBiasElementwise
       CopyOpS2G_,
       SmemLayoutAtomD_,
       CopyOpR2S_,
-      CopyAtomC_
+      CopyAtomC_,
+      CopyOpR2R_
 > {
 private:
   using Impl =
@@ -99,7 +101,8 @@ class Sm90EpilogueTmaWarpSpecializedBiasElementwise
       CopyOpS2G_,
       SmemLayoutAtomD_,
       CopyOpR2S_,
-      CopyAtomC_
+      CopyAtomC_,
+      CopyOpR2R_
     >;
 public:
   using DispatchPolicy = Sm90TmaWarpSpecializedBiasElementwise<StagesC_, StagesD_, FragmentSize_>;
diff --git a/include/cutlass/epilogue/dispatch_policy.hpp b/include/cutlass/epilogue/dispatch_policy.hpp
index 40d085ccde..e1603678c0 100644
--- a/include/cutlass/epilogue/dispatch_policy.hpp
+++ b/include/cutlass/epilogue/dispatch_policy.hpp
@@ -46,12 +46,27 @@ namespace cutlass::epilogue {
 //////////////////////////////////////////////////////////////////////////////
 
 struct PtrArrayDefault {};
+struct EpilogueSimtVectorized {};
+struct EpiloguePtrArraySimtVectorized {};
 struct NoSmemWarpSpecialized {};
 struct PtrArrayNoSmemWarpSpecialized {};
 struct PtrArrayPlanarComplexNoSmemWarpSpecialized {};
 struct TmaWarpSpecialized {};
 struct TmaWarpSpecializedCooperative {};
-struct PtrArrayTmaWarpSpecializedCooperative {};
+struct PtrArrayTmaWarpSpecializedCooperative {
+  static constexpr int NumEpilogueWarpGroups = 2;
+};
+
+// Standard warp specialized epilogue
+struct PtrArrayTmaWarpSpecialized {
+  static constexpr int NumEpilogueWarpGroups = 1;
+};
+
+// Pingpong kernel epilogue
+struct PtrArrayTmaWarpSpecializedPingpong {
+  static constexpr int NumEpilogueWarpGroups = 2;
+};
+
 // DEPRECATED schedules, will be removed in next release
 struct TmaWarpSpecializedElementwiseBase : public TmaWarpSpecialized {};
 struct TmaWarpSpecializedCooperativeElementwiseBase : public TmaWarpSpecializedCooperative {};
@@ -151,7 +166,8 @@ template<
   int StagesD_,
   int FragmentSize_,
   bool ReuseSmemC_,
-  bool DelayTmaStore_
+  bool DelayTmaStore_,
+  int NumEpilogueWarpGroups_
 >
 struct Sm90PtrArrayTmaWarpSpecialized {
   constexpr static int StagesC = StagesC_;
@@ -159,6 +175,7 @@ struct Sm90PtrArrayTmaWarpSpecialized {
   constexpr static int FragmentSize = FragmentSize_;
   constexpr static bool ReuseSmemC = ReuseSmemC_;
   constexpr static bool DelayTmaStore = DelayTmaStore_;
+  constexpr static int NumEpilogueWarpGroups = NumEpilogueWarpGroups_;
 };
 
 // DEPRECATED policies, will be removed in next release
diff --git a/include/cutlass/epilogue/fusion/operations.hpp b/include/cutlass/epilogue/fusion/operations.hpp
index a483b1ba94..3aed32710f 100644
--- a/include/cutlass/epilogue/fusion/operations.hpp
+++ b/include/cutlass/epilogue/fusion/operations.hpp
@@ -32,6 +32,8 @@
 #pragma once
 
 #include <cutlass/numeric_conversion.h>
+#include <cutlass/layout/matrix.h>
+#include <cute/numeric/numeric_types.hpp>
 
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
@@ -122,6 +124,19 @@ struct LinCombEltAct
   static constexpr bool IsEltActSupported = true;
 };
 
+// D = softmax(top_k(alpha * acc + beta * C))
+template<
+  int TopK,
+  class ElementOutput_,
+  class ElementCompute_,
+  class ElementSource_ = ElementOutput_,
+  class ElementScalar_ = ElementCompute_,
+  FloatRoundStyle RoundStyle_ = FloatRoundStyle::round_to_nearest
+>
+struct LinCombTopKSoftmaxCol
+    : LinearCombination<ElementOutput_, ElementCompute_, ElementSource_, ElementScalar_, RoundStyle_> {
+};
+
 
 // D = alpha * acc + beta * C + per-row bias
 template<
@@ -130,7 +145,7 @@ template<
   class ElementBias_ = ElementOutput_,
   class ElementSource_ = ElementOutput_,
   class ElementScalar_ = ElementCompute_,
-  int AlignmentBias_ = 128 / sizeof_bits_v<ElementBias_>,
+  int AlignmentBias_ = 128 / cute::sizeof_bits_v<ElementBias_>,
   FloatRoundStyle RoundStyle_ = FloatRoundStyle::round_to_nearest
 >
 struct LinCombPerRowBias
@@ -140,6 +155,23 @@ struct LinCombPerRowBias
   static constexpr bool IsPerRowBiasSupported = true;
 };
 
+// D = alpha * acc + beta * C + per-column bias
+template<
+  class ElementOutput_,
+  class ElementCompute_,
+  class ElementBias_ = ElementOutput_,
+  class ElementSource_ = ElementOutput_,
+  class ElementScalar_ = ElementCompute_,
+  int AlignmentBias_ = 128 / cute::sizeof_bits_v<ElementBias_>,
+  FloatRoundStyle RoundStyle_ = FloatRoundStyle::round_to_nearest
+>
+struct LinCombPerColBias
+    : LinearCombination<ElementOutput_, ElementCompute_, ElementSource_, ElementScalar_, RoundStyle_> {
+  using ElementBias = ElementBias_;
+  static constexpr int AlignmentBias = AlignmentBias_;
+  static constexpr bool IsPerColBiasSupported = true;
+};
+
 // D = activation(alpha * acc + beta * C + per-row bias)
 template<
   template <class> class ActivationFn_,
@@ -148,7 +180,7 @@ template<
   class ElementBias_ = ElementOutput_,
   class ElementSource_ = ElementOutput_,
   class ElementScalar_ = ElementCompute_,
-  int AlignmentBias_ = 128 / sizeof_bits_v<ElementBias_>,
+  int AlignmentBias_ = 128 / cute::sizeof_bits_v<ElementBias_>,
   FloatRoundStyle RoundStyle_ = FloatRoundStyle::round_to_nearest
 >
 struct LinCombPerRowBiasEltAct
@@ -169,8 +201,8 @@ template<
   class ElementBias_ = ElementOutput_,
   class ElementSource_ = ElementOutput_,
   class ElementScalar_ = ElementCompute_,
-  int AlignmentAux_ = 128 / sizeof_bits_v<ElementAux_>,
-  int AlignmentBias_ = 128 / sizeof_bits_v<ElementBias_>,
+  int AlignmentAux_ = 128 / cute::sizeof_bits_v<ElementAux_>,
+  int AlignmentBias_ = 128 / cute::sizeof_bits_v<ElementBias_>,
   FloatRoundStyle RoundStyle_ = FloatRoundStyle::round_to_nearest
 >
 struct LinCombPerRowBiasEltActAux
@@ -190,8 +222,8 @@ template<
   class ElementBias_ = ElementOutput_,
   class ElementSource_ = ElementOutput_,
   class ElementScalar_ = ElementCompute_, // per-row alpha/beta
-  int AlignmentBias_ = 128 / sizeof_bits_v<ElementBias_>,
-  int AlignmentScalar_ = 128 / sizeof_bits_v<ElementScalar_>,
+  int AlignmentBias_ = 128 / cute::sizeof_bits_v<ElementBias_>,
+  int AlignmentScalar_ = 128 / cute::sizeof_bits_v<ElementScalar_>,
   FloatRoundStyle RoundStyle_ = FloatRoundStyle::round_to_nearest
 >
 struct PerRowLinCombPerRowBiasEltAct
@@ -213,7 +245,7 @@ template<
   class ElementBias_ = ElementOutput_,
   class ElementSource_ = ElementOutput_,
   class ElementScalar_ = ElementCompute_,
-  int AlignmentBias_ = 128 / sizeof_bits_v<ElementBias_>,
+  int AlignmentBias_ = 128 / cute::sizeof_bits_v<ElementBias_>,
   FloatRoundStyle RoundStyle_ = FloatRoundStyle::round_to_nearest
 >
 struct ScaledLinCombPerRowBiasEltAct
@@ -243,8 +275,8 @@ template<
   class ElementBias_ = ElementOutput_,
   class ElementSource_ = ElementOutput_,
   class ElementScalar_ = ElementCompute_,
-  int AlignmentAux_ = 128 / sizeof_bits_v<ElementAux_>,
-  int AlignmentBias_ = 128 / sizeof_bits_v<ElementBias_>,
+  int AlignmentAux_ = 128 / cute::sizeof_bits_v<ElementAux_>,
+  int AlignmentBias_ = 128 / cute::sizeof_bits_v<ElementBias_>,
   FloatRoundStyle RoundStyle_ = FloatRoundStyle::round_to_nearest
 >
 struct ScaledLinCombPerRowBiasEltActAmaxAux
@@ -270,7 +302,7 @@ template<
   class ElementAux_ = ElementOutput_,
   class ElementSource_ = ElementOutput_,
   class ElementScalar_ = ElementCompute_,
-  int AlignmentAux_ = 128 / sizeof_bits_v<ElementAux_>,
+  int AlignmentAux_ = 128 / cute::sizeof_bits_v<ElementAux_>,
   FloatRoundStyle RoundStyle_ = FloatRoundStyle::round_to_nearest
 >
 struct LinCombDeEltAct
@@ -297,8 +329,8 @@ template<
   class ElementBias_ = ElementCompute_,
   class ElementSource_ = ElementOutput_,
   class ElementScalar_ = ElementCompute_,
-  int AlignmentAux_ = 128 / sizeof_bits_v<ElementAux_>,
-  int AlignmentBias_ = 128 / sizeof_bits_v<ElementBias_>,
+  int AlignmentAux_ = 128 / cute::sizeof_bits_v<ElementAux_>,
+  int AlignmentBias_ = 128 / cute::sizeof_bits_v<ElementBias_>,
   FloatRoundStyle RoundStyle_ = FloatRoundStyle::round_to_nearest
 >
 struct LinCombDeEltActDePerRowBias
diff --git a/include/cutlass/epilogue/fusion/sm90_callbacks_tma_warpspecialized.hpp b/include/cutlass/epilogue/fusion/sm90_callbacks_tma_warpspecialized.hpp
index 03ffbe454a..e028846a4f 100644
--- a/include/cutlass/epilogue/fusion/sm90_callbacks_tma_warpspecialized.hpp
+++ b/include/cutlass/epilogue/fusion/sm90_callbacks_tma_warpspecialized.hpp
@@ -46,6 +46,8 @@
 #include "cutlass/epilogue/fusion/sm90_visitor_store_tma_warpspecialized.hpp"
 #include "cutlass/epilogue/fusion/sm90_visitor_compute_tma_warpspecialized.hpp"
 
+#include "cutlass/epilogue/fusion/sm90_visitor_topk_softmax.hpp"
+
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
 namespace cutlass::epilogue::fusion {
@@ -75,12 +77,12 @@ struct FusionCallbacks<
     CtaTileShapeMNK,
     EpilogueTile
 > : Sm90EVT<Sm90Compute<multiplies, ElementOutput, ElementCompute, RoundStyle>,
-      Sm90ScalarBroadcast<ElementScalar>,
+      Sm90ScalarBroadcast<ElementScalar, Stride<_0,_0,int64_t>>, 
       Sm90AccFetch
     > {
   using Impl = 
     Sm90EVT<Sm90Compute<multiplies, ElementOutput, ElementCompute, RoundStyle>,
-      Sm90ScalarBroadcast<ElementScalar>,
+      Sm90ScalarBroadcast<ElementScalar, Stride<_0,_0,int64_t>>,
       Sm90AccFetch
     >;
   using Operation = fusion::ScaledAcc<ElementOutput, ElementCompute, ElementScalar, RoundStyle>;
@@ -92,12 +94,15 @@ struct FusionCallbacks<
     ElementScalar const* alpha_ptr = nullptr;
     ElementScalar const* beta_ptr = nullptr;
 
+    using StrideAlpha = Stride<_0,_0,int64_t>;
+    StrideAlpha dAlpha = {_0{}, _0{}, 0};
+
     // Conversion to the args expected by the visitor implementation
     // to_underlying_arguments will implicitly call this
     operator typename Impl::Arguments() const {
       return
         {    // binary op : alpha * acc
-          {{alpha}, {alpha_ptr}}, // leaf args : alpha
+          {{alpha}, {alpha_ptr}, {dAlpha}}, // leaf args : alpha
           {},                     // leaf args : acc
           {} // binary args : multiplies
         };   // end binary op
@@ -120,10 +125,10 @@ template<
 >
 using Sm90LinearCombination =
   Sm90EVT<Sm90Compute<homogeneous_multiply_add, ElementOutput, ElementCompute, RoundStyle>, // beta * C + (alpha * acc)
-    Sm90ScalarBroadcast<ElementScalar>, // beta
+    Sm90ScalarBroadcast<ElementScalar, Stride<_0,_0,int64_t>>, // beta
     Sm90SrcFetch<ElementSource>, // C
     Sm90EVT<Sm90Compute<multiplies, ElementCompute, ElementCompute, RoundStyle>, // alpha * acc
-      Sm90ScalarBroadcast<ElementScalar>, // alpha
+      Sm90ScalarBroadcast<ElementScalar, Stride<_0,_0,int64_t>>, // alpha
       Sm90AccFetch // acc
     >
   >;
@@ -158,13 +163,101 @@ struct FusionCallbacks<
     ElementScalar const* alpha_ptr = nullptr;
     ElementScalar const* beta_ptr = nullptr;
 
+    using StrideAlpha = Stride<_0,_0,int64_t>;
+    using StrideBeta  = Stride<_0,_0,int64_t>;
+    StrideAlpha dAlpha = {_0{}, _0{}, 0};
+    StrideBeta  dBeta  = {_0{}, _0{}, 0};
+
     operator typename Impl::Arguments() const {
       return
         {    // ternary op : beta * C + (alpha * acc)
-          {{beta}, {beta_ptr}}, // leaf args : beta
+          {{beta}, {beta_ptr}, {dBeta}}, // leaf args : beta
           {},                   // leaf args : C
           {                     // binary op : alpha * acc
-            {{alpha}, {alpha_ptr}}, // leaf args : alpha
+            {{alpha}, {alpha_ptr}, {dAlpha}}, // leaf args : alpha
+            {},                     // leaf args : acc
+            {}                  // binary args : multiplies
+          },                    // end binary op
+          {} // ternary args : multiply_add
+        };   // end ternary op
+    }
+  };
+
+  // Ctor inheritance
+  using Impl::Impl;
+};
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// D = alpha * acc + beta * C, where beta and alpha can be vectors for each batch
+template<
+  class ElementOutput,
+  class ElementCompute,
+  class ElementSource = ElementOutput,
+  class ElementScalar = ElementCompute,
+  FloatRoundStyle RoundStyle = FloatRoundStyle::round_to_nearest
+>
+using Sm90LinearCombinationPtrArray =
+  Sm90EVT<Sm90Compute<homogeneous_multiply_add, ElementOutput, ElementCompute, RoundStyle>, // beta * C + (alpha * acc)
+    Sm90ScalarBroadcastPtrArray<ElementScalar, Stride<_0,_0,int64_t>>, // beta
+    Sm90SrcFetch<ElementSource>, // C
+    Sm90EVT<Sm90Compute<multiplies, ElementCompute, ElementCompute, RoundStyle>, // alpha * acc
+      Sm90ScalarBroadcastPtrArray<ElementScalar, Stride<_0,_0,int64_t>>, // alpha
+      Sm90AccFetch // acc
+    >
+  >;
+
+template <
+  int StagesC,
+  int StagesD,
+  int FragmentSize,
+  bool ReuseSmemC,
+  bool DelayTmaStore,
+  int NumEpilogueWarpGroups,
+  class ElementOutput,
+  class ElementCompute,
+  class ElementSource,
+  class ElementScalar,
+  FloatRoundStyle RoundStyle,
+  class CtaTileShapeMNK,
+  class EpilogueTile
+>
+struct FusionCallbacks<
+    epilogue::Sm90PtrArrayTmaWarpSpecialized<StagesC, 
+                                             StagesD, 
+                                             FragmentSize, 
+                                             ReuseSmemC, 
+                                             DelayTmaStore, 
+                                             NumEpilogueWarpGroups
+                                            >,
+    fusion::LinearCombination<ElementOutput, ElementCompute, ElementSource, ElementScalar, RoundStyle>,
+    CtaTileShapeMNK,
+    EpilogueTile
+> : Sm90LinearCombinationPtrArray<typename cutlass::detail::get_unpacked_element_type<ElementOutput>::type, ElementCompute, ElementSource, ElementScalar, RoundStyle> {
+
+  using Impl = Sm90LinearCombinationPtrArray<typename cutlass::detail::get_unpacked_element_type<ElementOutput>::type, ElementCompute, ElementSource, ElementScalar, RoundStyle>;
+  using Operation = fusion::LinearCombination<ElementOutput, ElementCompute, ElementSource, ElementScalar, RoundStyle>;
+
+  struct Arguments {
+    ElementScalar alpha = ElementScalar(1);
+    ElementScalar beta = ElementScalar(0);
+    ElementScalar const* alpha_ptr = nullptr;
+    ElementScalar const* beta_ptr = nullptr;
+    ElementScalar const* const* alpha_ptr_array = nullptr;
+    ElementScalar const* const* beta_ptr_array = nullptr;
+
+    using StrideAlpha = Stride<_0,_0,int64_t>;
+    using StrideBeta  = Stride<_0,_0,int64_t>;
+    StrideAlpha dAlpha = {_0{}, _0{}, 0};
+    StrideBeta  dBeta  = {_0{}, _0{}, 0};
+
+    operator typename Impl::Arguments() const {
+      return
+        {    // ternary op : beta * C + (alpha * acc)
+          {{beta}, {beta_ptr}, {beta_ptr_array}, {dBeta}}, // leaf args : beta
+          {},                   // leaf args : C
+          {                     // binary op : alpha * acc
+            {{alpha}, {alpha_ptr}, {alpha_ptr_array}, {dAlpha}}, // leaf args : alpha
             {},                     // leaf args : acc
             {}                  // binary args : multiplies
           },                    // end binary op
@@ -224,6 +317,11 @@ struct FusionCallbacks<
     ElementScalar const* alpha_ptr = nullptr;
     ElementScalar const* beta_ptr = nullptr;
 
+    using StrideAlpha = Stride<_0,_0,int64_t>;
+    using StrideBeta  = Stride<_0,_0,int64_t>;
+    StrideAlpha dAlpha = {_0{}, _0{}, 0};
+    StrideBeta  dBeta  = {_0{}, _0{}, 0};
+
     using ActivationArguments = typename Sm90Compute<ActivationFn, ElementOutput, ElementCompute, RoundStyle>::Arguments;
     ActivationArguments activation = ActivationArguments();
 
@@ -231,10 +329,96 @@ struct FusionCallbacks<
       return
         {    // unary op: activation(beta * C + (alpha * acc))
           {    // ternary op : beta * C + (alpha * acc)
-            {{beta}, {beta_ptr}}, // leaf args : beta
+            {{beta}, {beta_ptr}, {dBeta}}, // leaf args : beta
             {},                   // leaf args : C
             {                     // binary op : alpha * acc
-              {{alpha}, {alpha_ptr}}, // leaf args : alpha
+              {{alpha}, {alpha_ptr}, {dAlpha}}, // leaf args : alpha
+              {},                     // leaf args : acc
+              {}                  // binary args : multiplies
+            },                    // end binary op
+            {} // ternary args : multiply_add
+          },   // end ternary op
+          activation // unary args: activation
+        };   // end unary op
+    }
+  };
+
+  // Ctor inheritance
+  using Impl::Impl;
+};
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// D = activation(alpha * acc + beta * C), where beta and alpha can be vectors for each batch
+template<
+  template <class> class ActivationFn,
+  class ElementOutput,
+  class ElementCompute,
+  class ElementSource = ElementOutput,
+  class ElementScalar = ElementCompute,
+  FloatRoundStyle RoundStyle = FloatRoundStyle::round_to_nearest
+>
+using Sm90LinCombEltActPtrArray =
+  Sm90EVT<Sm90Compute<ActivationFn, ElementOutput, ElementCompute, RoundStyle>, // activation(beta * C + (alpha * acc))
+    Sm90LinearCombinationPtrArray<ElementCompute, ElementCompute, ElementSource, ElementScalar, RoundStyle> // beta * C + (alpha * acc)
+  >;
+
+template <
+  int StagesC,
+  int StagesD,
+  int FragmentSize,
+  bool ReuseSmemC,
+  bool DelayTmaStore,
+  int NumEpilogueWarpGroups,
+  template <class> class ActivationFn,
+  class ElementOutput,
+  class ElementCompute,
+  class ElementSource,
+  class ElementScalar,
+  FloatRoundStyle RoundStyle,
+  class CtaTileShapeMNK,
+  class EpilogueTile
+>
+struct FusionCallbacks<
+    epilogue::Sm90PtrArrayTmaWarpSpecialized<StagesC, 
+                                             StagesD, 
+                                             FragmentSize, 
+                                             ReuseSmemC, 
+                                             DelayTmaStore, 
+                                             NumEpilogueWarpGroups
+                                            >,
+    fusion::LinCombEltAct<ActivationFn, ElementOutput, ElementCompute, ElementSource, ElementScalar, RoundStyle>,
+    CtaTileShapeMNK,
+    EpilogueTile
+> : Sm90LinCombEltActPtrArray<ActivationFn, ElementOutput, ElementCompute, ElementSource, ElementScalar, RoundStyle> {
+
+  using Impl = Sm90LinCombEltActPtrArray<ActivationFn, typename cutlass::detail::get_unpacked_element_type<ElementOutput>::type, ElementCompute, ElementSource, ElementScalar, RoundStyle>;
+  using Operation = fusion::LinCombEltAct<ActivationFn, ElementOutput, ElementCompute, ElementSource, ElementScalar, RoundStyle>;
+
+  struct Arguments {
+    ElementScalar alpha = ElementScalar(1);
+    ElementScalar beta = ElementScalar(0);
+    ElementScalar const* alpha_ptr = nullptr;
+    ElementScalar const* beta_ptr = nullptr;
+    ElementScalar const* const* alpha_ptr_array = nullptr;
+    ElementScalar const* const* beta_ptr_array = nullptr;
+
+    using StrideAlpha = Stride<_0,_0,int64_t>;
+    using StrideBeta  = Stride<_0,_0,int64_t>;
+    StrideAlpha dAlpha = {_0{}, _0{}, 0};
+    StrideBeta  dBeta  = {_0{}, _0{}, 0};
+
+    using ActivationArguments = typename Sm90Compute<ActivationFn, ElementOutput, ElementCompute, RoundStyle>::Arguments;
+    ActivationArguments activation = ActivationArguments();
+
+    operator typename Impl::Arguments() const {
+      return
+        {    // unary op: activation(beta * C + (alpha * acc))
+          {    // ternary op : beta * C + (alpha * acc)
+            {{beta}, {beta_ptr}, {beta_ptr_array}, {dBeta}}, // leaf args : beta
+            {},                   // leaf args : C
+            {                     // binary op : alpha * acc
+              {{alpha}, {alpha_ptr}, {alpha_ptr_array}, {dAlpha}}, // leaf args : alpha
               {},                     // leaf args : acc
               {}                  // binary args : multiplies
             },                    // end binary op
@@ -264,12 +448,12 @@ template<
 >
 using Sm90LinCombPerRowBias =
   Sm90EVT<Sm90Compute<homogeneous_multiply_add, ElementOutput, ElementCompute, RoundStyle>, // beta * C + (alpha * acc + bias)
-    Sm90ScalarBroadcast<ElementScalar>, // beta
+    Sm90ScalarBroadcast<ElementScalar, Stride<_0,_0,int64_t>>, // beta
     Sm90SrcFetch<ElementSource>, // C
     Sm90EVT<Sm90Compute<homogeneous_multiply_add, ElementCompute, ElementCompute, RoundStyle>, // alpha * acc + bias
-      Sm90ScalarBroadcast<ElementScalar>, // alpha
+      Sm90ScalarBroadcast<ElementScalar, Stride<_0,_0,int64_t>>, // alpha
       Sm90AccFetch, // acc
-      Sm90ColBroadcast<0, CtaTileShapeMNK, ElementBias, Stride<_1,_0,int>, AlignmentBias> // bias
+      Sm90ColBroadcast<0, CtaTileShapeMNK, ElementBias, ElementCompute, Stride<_1,_0,int64_t>, AlignmentBias> // bias
     >
   >;
 
@@ -307,17 +491,111 @@ struct FusionCallbacks<
     ElementScalar const* alpha_ptr = nullptr;
     ElementScalar const* beta_ptr = nullptr;
 
-    using StrideBias = Stride<_1,_0,int>;
+    using StrideAlpha = Stride<_0,_0,int64_t>;
+    using StrideBeta  = Stride<_0,_0,int64_t>;
+    StrideAlpha dAlpha = {_0{}, _0{}, 0};
+    StrideBeta  dBeta  = {_0{}, _0{}, 0};
+
+    using StrideBias = Stride<_1,_0,int64_t>;
+    ElementBias const* bias_ptr = nullptr;
+    StrideBias dBias = {};
+
+    operator typename Impl::Arguments() const {
+      return
+        {     // ternary op : beta * C + (alpha * acc + bias)
+          {{beta}, {beta_ptr}, {dBeta}}, // leaf args : beta
+          {},                   // leaf args : C
+          {                     // ternary op : alpha * acc + bias
+            {{alpha}, {alpha_ptr}, {dAlpha}}, // leaf args : alpha
+            {},                     // leaf args : acc
+            {bias_ptr, ElementBias(0), dBias}, // leaf args : bias
+            {}                  // ternary args : multiply_add
+          },                    // end ternary op
+          {} // ternary args : multiply_add
+        };   // end ternary op
+    }
+  };
+
+  // Ctor inheritance
+  using Impl::Impl;
+};
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// D = alpha * acc + beta * C + per-column bias
+template<
+  int StagesC,
+  class CtaTileShapeMNK,
+  class EpilogueTile,
+  class ElementOutput,
+  class ElementCompute,
+  class ElementBias = ElementOutput,
+  class ElementSource = ElementOutput,
+  class ElementScalar = ElementCompute,
+  int AlignmentBias = 128 / sizeof_bits_v<ElementBias>,
+  FloatRoundStyle RoundStyle = FloatRoundStyle::round_to_nearest
+>
+using Sm90LinCombPerColBias =
+  Sm90EVT<Sm90Compute<homogeneous_multiply_add, ElementOutput, ElementCompute, RoundStyle>, // beta * C + (alpha * acc + bias)
+    Sm90ScalarBroadcast<ElementScalar, Stride<_0,_0,int64_t>>, // beta
+    Sm90SrcFetch<ElementSource>, // C
+    Sm90EVT<Sm90Compute<homogeneous_multiply_add, ElementCompute, ElementCompute, RoundStyle>, // alpha * acc + bias
+      Sm90ScalarBroadcast<ElementScalar, Stride<_0,_0,int64_t>>, // alpha
+      Sm90AccFetch, // acc
+      Sm90RowBroadcast<0, CtaTileShapeMNK, ElementBias, ElementCompute, Stride<_0,_1,int64_t>, AlignmentBias> // bias
+    >
+  >;
+
+template <
+  int StagesC,
+  int StagesD,
+  int FragmentSize,
+  bool ReuseSmemC,
+  bool DelayTmaStore,
+  class ElementOutput,
+  class ElementCompute,
+  class ElementBias,
+  class ElementSource,
+  class ElementScalar,
+  int AlignmentBias,
+  FloatRoundStyle RoundStyle,
+  class CtaTileShapeMNK,
+  class EpilogueTile
+>
+struct FusionCallbacks<
+    epilogue::Sm90TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore>,
+    fusion::LinCombPerColBias<ElementOutput, ElementCompute, ElementBias, ElementSource, ElementScalar, AlignmentBias, RoundStyle>,
+    CtaTileShapeMNK,
+    EpilogueTile
+> : Sm90LinCombPerColBias<
+      StagesC, CtaTileShapeMNK, EpilogueTile, ElementOutput, ElementCompute, ElementBias, ElementSource, ElementScalar, AlignmentBias, RoundStyle> {
+  using Impl = Sm90LinCombPerColBias<
+    StagesC, CtaTileShapeMNK, EpilogueTile, ElementOutput, ElementCompute, ElementBias, ElementSource, ElementScalar, AlignmentBias, RoundStyle>;
+  using Operation = fusion::LinCombPerColBias<
+    ElementOutput, ElementCompute, ElementBias, ElementSource, ElementScalar, AlignmentBias, RoundStyle>;
+
+  struct Arguments {
+    ElementScalar alpha = ElementScalar(1);
+    ElementScalar beta = ElementScalar(0);
+    ElementScalar const* alpha_ptr = nullptr;
+    ElementScalar const* beta_ptr = nullptr;
+
+    using StrideAlpha = Stride<_0,_0,int64_t>;
+    using StrideBeta  = Stride<_0,_0,int64_t>;
+    StrideAlpha dAlpha = {_0{}, _0{}, 0};
+    StrideBeta  dBeta  = {_0{}, _0{}, 0};
+
+    using StrideBias = Stride<_0,_1,int64_t>;
     ElementBias const* bias_ptr = nullptr;
     StrideBias dBias = {};
 
     operator typename Impl::Arguments() const {
       return
         {     // ternary op : beta * C + (alpha * acc + bias)
-          {{beta}, {beta_ptr}}, // leaf args : beta
+          {{beta}, {beta_ptr}, {dBeta}}, // leaf args : beta
           {},                   // leaf args : C
           {                     // ternary op : alpha * acc + bias
-            {{alpha}, {alpha_ptr}}, // leaf args : alpha
+            {{alpha}, {alpha_ptr}, {dAlpha}}, // leaf args : alpha
             {},                     // leaf args : acc
             {bias_ptr, ElementBias(0), dBias}, // leaf args : bias
             {}                  // ternary args : multiply_add
@@ -393,7 +671,12 @@ struct FusionCallbacks<
     ElementScalar const* alpha_ptr = nullptr;
     ElementScalar const* beta_ptr = nullptr;
 
-    using StrideBias = Stride<_1,_0,int>;
+    using StrideAlpha = Stride<_0,_0,int64_t>;
+    using StrideBeta  = Stride<_0,_0,int64_t>;
+    StrideAlpha dAlpha = {_0{}, _0{}, 0};
+    StrideBeta  dBeta  = {_0{}, _0{}, 0};
+
+    using StrideBias = Stride<_1,_0,int64_t>;
     ElementBias const* bias_ptr = nullptr;
     StrideBias dBias = {};
 
@@ -404,10 +687,10 @@ struct FusionCallbacks<
       return
         {    // unary op : activation(beta * C + (alpha * acc + bias))
           {    // ternary op : beta * C + (alpha * acc + bias)
-            {{beta}, {beta_ptr}}, // leaf args : beta
+            {{beta}, {beta_ptr}, {dBeta}}, // leaf args : beta
             {},                   // leaf args : C
             {                     // ternary op : alpha * acc + bias
-              {{alpha}, {alpha_ptr}}, // leaf args : alpha
+              {{alpha}, {alpha_ptr}, {dAlpha}}, // leaf args : alpha
               {},                     // leaf args : acc
               {bias_ptr, ElementBias(0), dBias}, // leaf args : bias
               {}                  // ternary args : multiply_add
@@ -506,7 +789,12 @@ struct FusionCallbacks<
     ElementScalar const* alpha_ptr = nullptr;
     ElementScalar const* beta_ptr = nullptr;
 
-    using StrideBias = Stride<_1,_0,int>;
+    using StrideAlpha = Stride<_0,_0,int64_t>;
+    using StrideBeta  = Stride<_0,_0,int64_t>;
+    StrideAlpha dAlpha = {_0{}, _0{}, 0};
+    StrideBeta  dBeta  = {_0{}, _0{}, 0};
+
+    using StrideBias = Stride<_1,_0,int64_t>;
     ElementBias const* bias_ptr = nullptr;
     StrideBias dBias = {};
 
@@ -522,10 +810,10 @@ struct FusionCallbacks<
         {    // unary op : activation(store(beta * C + (alpha * acc + bias)))
           {                 // unary op : store(beta * C + (alpha * acc + bias))
             {                  // ternary op : beta * C + (alpha * acc + bias)
-              {{beta}, {beta_ptr}}, // leaf args : beta
+              {{beta}, {beta_ptr}, {dBeta}}, // leaf args : beta
               {},                   // leaf args : C
               {                     // ternary op : alpha * acc + bias
-                {{alpha}, {alpha_ptr}}, // leaf args : alpha
+                {{alpha}, {alpha_ptr}, {dAlpha}}, // leaf args : alpha
                 {},                     // leaf args : acc
                 {bias_ptr, ElementBias(0), dBias}, // leaf args : bias
                 {}                  // ternary args : multiply_add
@@ -558,12 +846,12 @@ template<
 >
 using Sm90PerRowLinCombPerRowBias =
   Sm90EVT<Sm90Compute<homogeneous_multiply_add, ElementOutput, ElementCompute, RoundStyle>, // beta * C + (alpha * acc + bias)
-    Sm90ColBroadcast<0, CtaTileShapeMNK, ElementScalar, Stride<_1,_0,int>, AlignmentScalar>, // beta
+    Sm90ColBroadcast<0, CtaTileShapeMNK, ElementScalar, ElementCompute, Stride<bool,_0,int64_t>, AlignmentScalar>, // beta, dynamic scalar/vector broadcast
     Sm90SrcFetch<ElementSource>, // C
     Sm90EVT<Sm90Compute<homogeneous_multiply_add, ElementCompute, ElementCompute, RoundStyle>, // alpha * acc + bias
-      Sm90ColBroadcast<0, CtaTileShapeMNK, ElementScalar, Stride<_1,_0,int>, AlignmentScalar>, // alpha
+      Sm90ColBroadcast<0, CtaTileShapeMNK, ElementScalar, ElementCompute, Stride<bool,_0,int64_t>, AlignmentScalar>, // alpha, dynamic scalar/vector broadcast
       Sm90AccFetch, // acc
-      Sm90ColBroadcast<0, CtaTileShapeMNK, ElementBias, Stride<_1,_0,int>, AlignmentBias> // bias
+      Sm90ColBroadcast<0, CtaTileShapeMNK, ElementBias, ElementCompute, Stride<_1,_0,int64_t>, AlignmentBias> // bias
     >
   >;
 
@@ -625,16 +913,16 @@ struct FusionCallbacks<
     >;
 
   struct Arguments {
-    using StrideAlpha = Stride<_1,_0,int>;
-    using StrideBeta  = Stride<_1,_0,int>;
+    using StrideAlpha = Stride<bool,_0,int64_t>;
+    using StrideBeta  = Stride<bool,_0,int64_t>;
     ElementScalar alpha = ElementScalar(1);
     ElementScalar beta = ElementScalar(0);
     ElementScalar const* alpha_ptr = nullptr;
     ElementScalar const* beta_ptr = nullptr;
-    StrideAlpha dAlpha = {};
-    StrideBeta  dBeta  = {};
+    StrideAlpha dAlpha = {bool(1), _0{}, 0};
+    StrideBeta  dBeta  = {bool(1), _0{}, 0};
 
-    using StrideBias = Stride<_1,_0,int>;
+    using StrideBias = Stride<_1,_0,int64_t>;
     ElementBias const* bias_ptr = nullptr;
     StrideBias dBias = {};
 
@@ -697,12 +985,12 @@ template<
 >
 using Sm90ScaledLinCombPerRowBias =
   Sm90EVT<Sm90Compute<homogeneous_multiply_add, ElementOutput, ElementCompute, RoundStyle>, // beta * C + (alpha * acc + bias)
-    Sm90ScalarBroadcast<ElementScalar, Stride<_0,_0,_0>, 2>, // scale_c * beta
+    Sm90ScalarBroadcast<ElementScalar, Stride<_0,_0,int64_t>, 2>, // scale_c * beta
     Sm90SrcFetch<ElementSource>, // C
     Sm90EVT<Sm90Compute<homogeneous_multiply_add, ElementCompute, ElementCompute, RoundStyle>, // alpha * acc + bias
-      Sm90ScalarBroadcast<ElementScalar, Stride<_0,_0,_0>, 3>, // scale_a * scale_b * alpha
+      Sm90ScalarBroadcast<ElementScalar, Stride<_0,_0,int64_t>, 3>, // scale_a * scale_b * alpha
       Sm90AccFetch, // acc
-      Sm90ColBroadcast<0, CtaTileShapeMNK, ElementBias, Stride<_1,_0,int>, AlignmentBias> // bias
+      Sm90ColBroadcast<0, CtaTileShapeMNK, ElementBias, ElementCompute, Stride<_1,_0,int64_t>, AlignmentBias> // bias
     >
   >;
 
@@ -783,7 +1071,12 @@ struct FusionCallbacks<
     ElementScalar const* scale_c_ptr = nullptr;
     ElementScalar const* scale_d_ptr = nullptr;
 
-    using StrideBias = Stride<_1,_0,int>;
+    using StrideAlpha = Stride<_0,_0,int64_t>;
+    using StrideBeta  = Stride<_0,_0,int64_t>;
+    StrideAlpha dAlpha = {_0{}, _0{}, 0};
+    StrideBeta  dBeta  = {_0{}, _0{}, 0};
+
+    using StrideBias = Stride<_1,_0,int64_t>;
     ElementBias const* bias_ptr = nullptr;
     StrideBias dBias = {};
 
@@ -795,13 +1088,15 @@ struct FusionCallbacks<
         {    // binary op : activation((scale_c * beta) * C + ((scale_a * scale_b * alpha) * acc + bias)) * scale_d
           {    // unary op : activation((scale_c * beta) * C + ((scale_a * scale_b * alpha) * acc + bias))
             {    // ternary op : (scale_c * beta) * C + ((scale_a * scale_b * alpha) * acc + bias)
-              {{scale_c, beta},
-               {scale_c_ptr, beta_ptr}
+              {{beta, scale_c},
+               {beta_ptr, scale_c_ptr},
+               {dBeta, {_0{}, _0{}, 0}}
                },  // leaf args : (scale_c * beta)
               {},  // leaf args : C
               {    // ternary op : (scale_a * scale_b * alpha) * acc + bias
-                {{scale_a, scale_b, alpha}, 
-                 {scale_a_ptr, scale_b_ptr, alpha_ptr}
+                {{alpha, scale_a, scale_b}, 
+                 {alpha_ptr, scale_a_ptr, scale_b_ptr},
+                 {dAlpha, {_0{}, _0{}, 0}, {_0{}, _0{}, 0}}
                  },                   // leaf args : (scale_a * scale_b * alpha)
                 {},                   // leaf args : acc
                 {bias_ptr, ElementBias(0), dBias}, // leaf args : bias
@@ -1017,7 +1312,12 @@ struct FusionCallbacks<
     ElementScalar scale_aux = ElementScalar(1);
     ElementScalar const* scale_aux_ptr = nullptr;
 
-    using StrideBias = Stride<_1,_0,int>;
+    using StrideAlpha = Stride<_0,_0,int64_t>;
+    using StrideBeta  = Stride<_0,_0,int64_t>;
+    StrideAlpha dAlpha = {_0{}, _0{}, 0};
+    StrideBeta  dBeta  = {_0{}, _0{}, 0};
+
+    using StrideBias = Stride<_1,_0,int64_t>;
     ElementBias const* bias_ptr = nullptr;
     StrideBias dBias = {};
 
@@ -1046,13 +1346,15 @@ struct FusionCallbacks<
 
         Z_args =
           {    // ternary op : (scale_c * beta) * C + ((scale_a * scale_b * alpha) * acc + bias)
-            {{scale_c, beta},
-             {scale_c_ptr, beta_ptr}
+            {{beta, scale_c},
+             {beta_ptr, scale_c_ptr},
+             {dBeta, {_0{}, _0{}, 0}}
              },  // leaf args : (scale_c * beta)
             {},  // leaf args : C
             {    // ternary op : (scale_a * scale_b * alpha) * acc + bias
-              {{scale_a, scale_b, alpha}, 
-               {scale_a_ptr, scale_b_ptr, alpha_ptr}
+              {{alpha, scale_a, scale_b}, 
+               {alpha_ptr, scale_a_ptr, scale_b_ptr},
+               {dAlpha ,{_0{}, _0{}, 0}, {_0{}, _0{}, 0}}
                },                   // leaf args : (scale_a * scale_b * alpha)
               {},                   // leaf args : acc
               {bias_ptr, ElementBias(0), dBias}, // leaf args : bias
@@ -1102,13 +1404,15 @@ struct FusionCallbacks<
               {  // unary op : activation(Z)
                 {  // unary op : store(Z)
                   {  // ternary op : (scale_c * beta) * C + ((scale_a * scale_b * alpha) * acc + bias)
-                    {{scale_c, beta},
-                     {scale_c_ptr, beta_ptr}
+                    {{beta, scale_c},
+                     {beta_ptr, scale_c_ptr},
+                     {dBeta, {_0{}, _0{}, 0}}
                     },                // leaf args : (scale_c * beta)
                     {},               // leaf args : C
                     {                 // ternary op : (scale_a * scale_b * alpha) * acc + bias
-                      {{scale_a, scale_b, alpha}, 
-                       {scale_a_ptr, scale_b_ptr, alpha_ptr}
+                      {{alpha, scale_a, scale_b}, 
+                       {alpha_ptr, scale_a_ptr, scale_b_ptr},
+                       {dAlpha, {_0{}, _0{}, 0}}
                       },                // leaf args : (scale_a * scale_b * alpha)
                       {},               // leaf args : acc
                       {bias_ptr, ElementBias(0), dBias
@@ -1210,6 +1514,11 @@ struct FusionCallbacks<
     ElementScalar const* alpha_ptr = nullptr;
     ElementScalar const* beta_ptr = nullptr;
 
+    using StrideAlpha = Stride<_0,_0,int64_t>;
+    using StrideBeta  = Stride<_0,_0,int64_t>;
+    StrideAlpha dAlpha = {_0{}, _0{}, 0};
+    StrideBeta  dBeta  = {_0{}, _0{}, 0};
+
     using ActivationArguments = typename Sm90Compute<ActivationFn, ElementOutput, ElementCompute, RoundStyle>::Arguments;
     ActivationArguments activation = ActivationArguments();
 
@@ -1221,10 +1530,10 @@ struct FusionCallbacks<
       return
         {    // binary op : activation(beta * C + (alpha * acc), aux)
           {                  // ternary op : beta * C + (alpha * acc)
-            {{beta}, {beta_ptr}}, // leaf args : beta
+            {{beta}, {beta_ptr}, {dBeta}}, // leaf args : beta
             {},                   // leaf args : C
             {                     // binary op : alpha * acc
-              {{alpha}, {alpha_ptr}}, // leaf args : alpha
+              {{alpha}, {alpha_ptr}, {dAlpha}}, // leaf args : alpha
               {},                     // leaf args : acc
               {}                  // binary args : multiplies
             },                    // end binary op
@@ -1263,7 +1572,7 @@ template<
 using Sm90LinCombDeEltActDePerRowBias =
   Sm90EVT<Sm90Compute<cutlass::epilogue::thread::Identity, ElementOutput, ElementCompute, RoundStyle>, // Identity for final conversion
     Sm90EVT<Sm90ColReduction<plus, plus, plus, 0, CtaTileShapeMNK,
-                             ElementBias, ElementCompute, RoundStyle, Stride<_1,_0,int>, AlignmentBias>,
+                             ElementBias, ElementCompute, RoundStyle, Stride<_1,_0,int64_t>, AlignmentBias>,
       Sm90LinCombDeEltAct<CtaTileShapeMNK, EpilogueTile, Stages, StrideAux, SmemLayoutAtom, CopyOpS2R, ActivationFn,
                           ElementCompute, ElementCompute, ElementAux, ElementSource, ElementScalar, AlignmentAux, RoundStyle>
     >
@@ -1323,6 +1632,11 @@ struct FusionCallbacks<
     ElementScalar const* alpha_ptr = nullptr;
     ElementScalar const* beta_ptr = nullptr;
 
+    using StrideAlpha = Stride<_0,_0,int64_t>;
+    using StrideBeta  = Stride<_0,_0,int64_t>;
+    StrideAlpha dAlpha = {_0{}, _0{}, 0};
+    StrideBeta  dBeta  = {_0{}, _0{}, 0};
+
     using ActivationArguments = typename Sm90Compute<ActivationFn, ElementOutput, ElementCompute, RoundStyle>::Arguments;
     ActivationArguments activation = ActivationArguments();
 
@@ -1330,7 +1644,7 @@ struct FusionCallbacks<
     ElementAux const* aux_ptr = nullptr;
     StrideAux dAux = {};
 
-    using StrideBias = Stride<_1,_0,int>;
+    using StrideBias = Stride<_1,_0,int64_t>;
     ElementBias* dbias_ptr = nullptr;
     StrideBias dDbias = {};
 
@@ -1340,10 +1654,10 @@ struct FusionCallbacks<
         {    // unary op : reduce(activation(beta * C + (alpha * acc), aux))
           {    // binary op : activation(beta * C + (alpha * acc), aux)
             {                  // ternary op : beta * C + (alpha * acc)
-              {{beta}, {beta_ptr}}, // leaf args : beta
+              {{beta}, {beta_ptr}, {dBeta}}, // leaf args : beta
               {},                   // leaf args : C
               {                     // binary op : alpha * acc
-                {{alpha}, {alpha_ptr}}, // leaf args : alpha
+                {{alpha}, {alpha_ptr}, {dAlpha}}, // leaf args : alpha
                 {},                     // leaf args : acc
                 {}                  // binary args : multiplies
               },                    // end binary op
@@ -1365,6 +1679,78 @@ struct FusionCallbacks<
 
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
+// D = softmax(top_k(alpha * acc + beta * C))
+template<
+  int TopK,
+  int FragmentSize,
+  class CtaTileShapeMNK,
+  class EpilogueTile,
+  class ElementOutput,
+  class ElementCompute,
+  class ElementSource = ElementOutput,
+  class ElementScalar = ElementCompute,
+  FloatRoundStyle RoundStyle = FloatRoundStyle::round_to_nearest
+>
+using Sm90LinCombTopKSoftmaxCol =
+  Sm90EVT<Sm90TopKSoftmaxColReduction<TopK, FragmentSize, CtaTileShapeMNK, EpilogueTile, ElementOutput, ElementCompute, RoundStyle>, // softmax(top_k(beta * C + (alpha * acc)))
+    Sm90LinearCombination<ElementCompute, ElementCompute, ElementSource, ElementScalar, RoundStyle> // beta * C + (alpha * acc)
+  >;
+
+template <
+  int TopK,
+  int StagesC,
+  int StagesD,
+  int FragmentSize,
+  bool ReuseSmemC,
+  bool DelayTmaStore,
+  class ElementOutput,
+  class ElementCompute,
+  class ElementSource,
+  class ElementScalar,
+  FloatRoundStyle RoundStyle,
+  class CtaTileShapeMNK,
+  class EpilogueTile
+>
+struct FusionCallbacks<
+    epilogue::Sm90TmaWarpSpecialized<StagesC, StagesD, FragmentSize, ReuseSmemC, DelayTmaStore>,
+    fusion::LinCombTopKSoftmaxCol<TopK, ElementOutput, ElementCompute, ElementSource, ElementScalar, RoundStyle>,
+    CtaTileShapeMNK,
+    EpilogueTile
+> : Sm90LinCombTopKSoftmaxCol<TopK, FragmentSize, CtaTileShapeMNK, EpilogueTile, ElementOutput, ElementCompute, ElementSource, ElementScalar, RoundStyle> {
+
+  using Impl = Sm90LinCombTopKSoftmaxCol<TopK, FragmentSize, CtaTileShapeMNK, EpilogueTile, typename cutlass::detail::get_unpacked_element_type<ElementOutput>::type, ElementCompute, ElementSource, ElementScalar, RoundStyle>;
+  using Operation = fusion::LinCombTopKSoftmaxCol<TopK, ElementOutput, ElementCompute, ElementSource, ElementScalar, RoundStyle>;
+
+  struct Arguments {
+    ElementScalar alpha = ElementScalar(1);
+    ElementScalar beta = ElementScalar(0);
+    ElementScalar const* alpha_ptr = nullptr;
+    ElementScalar const* beta_ptr = nullptr;
+
+    operator typename Impl::Arguments() const {
+      return
+        {    // unary op: activation(beta * C + (alpha * acc))
+          {    // ternary op : beta * C + (alpha * acc)
+            {{beta}, {beta_ptr}}, // leaf args : beta
+            {},                   // leaf args : C
+            {                     // binary op : alpha * acc
+              {{alpha}, {alpha_ptr}}, // leaf args : alpha
+              {},                     // leaf args : acc
+              {}                  // binary args : multiplies
+            },                    // end binary op
+            {} // ternary args : multiply_add
+          },   // end ternary op
+          {} // unary args: activation
+        };   // end unary op
+    }
+  };
+
+  // Ctor inheritance
+  using Impl::Impl;
+};
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
 namespace detail {
 template <class FusionOpOrCallbacks, class = cute::void_t<>>
 struct get_element_aux {
diff --git a/include/cutlass/epilogue/fusion/sm90_visitor_compute_tma_warpspecialized.hpp b/include/cutlass/epilogue/fusion/sm90_visitor_compute_tma_warpspecialized.hpp
index 4b0439791f..49cbb38a90 100644
--- a/include/cutlass/epilogue/fusion/sm90_visitor_compute_tma_warpspecialized.hpp
+++ b/include/cutlass/epilogue/fusion/sm90_visitor_compute_tma_warpspecialized.hpp
@@ -78,7 +78,7 @@ using namespace detail;
 // the template argument.
 //
 // template<class A>
-// struct FooHomogeneous : public Foo<A, B> {};
+// struct FooHomogeneous : public Foo<A, A> {};
 //
 template<
   template <class> class ComputeFn,
@@ -181,14 +181,20 @@ struct Sm90Compute {
         },
         [&] (auto&&... cvt_frg_inputs) {
           using ComputeOutput = ComputeFn<Array<ElementCompute, FragmentSize>>;
-          using ConvertOutput = NumericArrayConverter<ElementOutput, ElementCompute, FragmentSize, RoundStyle>;
           ComputeOutput compute_output{};
-          ConvertOutput convert_output{};
 
           if constexpr (cute::is_same_v<Arguments, EmptyArguments>) {
+            using ElementComputeOutput =
+                typename cute::remove_cvref_t<decltype(compute_output(cvt_frg_inputs...))>::Element;
+            using ConvertOutput = NumericArrayConverter<ElementOutput, ElementComputeOutput, FragmentSize, RoundStyle>;
+            ConvertOutput convert_output{};
             return convert_output(compute_output(cvt_frg_inputs...));
           }
           else {
+            using ElementComputeOutput =
+                typename cute::remove_cvref_t<decltype(compute_output(cvt_frg_inputs..., params))>::Element;
+            using ConvertOutput = NumericArrayConverter<ElementOutput, ElementComputeOutput, FragmentSize, RoundStyle>;
+            ConvertOutput convert_output{};
             return convert_output(compute_output(cvt_frg_inputs..., params));
           }
         }
@@ -257,8 +263,16 @@ struct Sm90TreeVisitor<
 
   CUTLASS_DEVICE bool
   is_producer_load_needed() const {
+    auto const& scale_op = get<0>(Impl::ops);
     auto const& added_op = get<2>(Impl::ops);
-    return is_C_load_needed() || added_op.is_producer_load_needed();
+    if constexpr (detail::IsScalarBroadcast<InputScaleOp>::value && not is_void_v<ElementSource>) {
+      return (get<2>(scale_op.params_ptr->dScalar[0]) != 0 && scale_op.params_ptr->scalar_ptrs[0] != nullptr) || 
+              is_C_load_needed() || 
+              added_op.is_producer_load_needed();
+    }
+    else {
+      return is_C_load_needed() || added_op.is_producer_load_needed();
+    }
   }
 
   CUTLASS_DEVICE bool
@@ -290,7 +304,7 @@ struct Sm90TreeVisitor<
 
       Array frg_I = convert_Z(frg_added);
 
-      if (is_C_load_needed) {
+      if constexpr (!is_void_v<ElementSource>) {
         Array frg_scalar = get<0>(CallbacksImpl::callbacks_tuple).visit(frg_acc, epi_v, epi_m, epi_n);
         Array frg_source = get<1>(CallbacksImpl::callbacks_tuple).visit(frg_acc, epi_v, epi_m, epi_n);
 
@@ -317,8 +331,12 @@ struct Sm90TreeVisitor<
   CUTLASS_DEVICE auto
   get_consumer_store_callbacks(ConsumerStoreArgs<Args...> const& args) {
     auto callbacks_tuple = Impl::template get_consumer_store_callbacks<ReferenceSrc>(args);
+    bool is_C_load_needed = this->is_C_load_needed();
+    if (not is_C_load_needed) {
+      cute::clear(args.tCrC);
+    }
     return ConsumerStoreCallbacks<decltype(callbacks_tuple)>(
-        is_C_load_needed(), std::move(callbacks_tuple));
+        is_C_load_needed, std::move(callbacks_tuple));
   }
 };
 
@@ -491,7 +509,22 @@ struct Sm90TreeVisitor<
         else {
           frg_compute[i] = relu(frg_compute[i]);
         }
-        frg_aux[i] = frg_compute[i] == pre_relu;
+        if constexpr (cute::is_same_v<ElementCompute, float>) {
+          uint32_t aux;
+#if defined(__SYCL_CUDA_ARCH__) || defined(__CUDA_ARCH__)
+          asm volatile("set.equ.u32.f32 %0, %1, %2;\n" : "=r"(aux) : "f"(frg_compute[i]), "f"(pre_relu)); // NaN outputs 1 in Aux
+#endif
+          frg_aux[i] = static_cast<bool>(aux);
+        } else if constexpr (cute::is_same_v<ElementCompute, cutlass::half_t>) {
+          uint32_t aux;
+          cutlass::half_t compute = frg_compute[i];
+#if defined(__SYCL_CUDA_ARCH__) || defined(__CUDA_ARCH__)
+          asm volatile("set.equ.u32.f16 %0, %1, %2;\n" : "=r"(aux) : "h"(compute.raw()), "h"(pre_relu.raw())); // NaN outputs 1 in Aux
+#endif
+          frg_aux[i] = static_cast<bool>(aux);
+        } else {
+          frg_aux[i] = frg_compute[i] == pre_relu;
+        }
       }
 
       static_assert(FragmentSize % 8 == 0, "Predicate vector must be byte-aligned");
diff --git a/include/cutlass/epilogue/fusion/sm90_visitor_load_tma_warpspecialized.hpp b/include/cutlass/epilogue/fusion/sm90_visitor_load_tma_warpspecialized.hpp
index 4eb326b3dd..a22bed4e0d 100644
--- a/include/cutlass/epilogue/fusion/sm90_visitor_load_tma_warpspecialized.hpp
+++ b/include/cutlass/epilogue/fusion/sm90_visitor_load_tma_warpspecialized.hpp
@@ -37,6 +37,7 @@
 
 #include "cutlass/cutlass.h"
 #include "cutlass/arch/barrier.h"
+#include "cutlass/epilogue/collective/detail.hpp"
 
 #include "cute/tensor.hpp"
 #include "sm90_visitor_tma_warpspecialized.hpp"
@@ -377,6 +378,174 @@ struct Sm90AuxLoad {
   }
 };
 
+template <
+  class Element,
+  class EpilogueTile,   // Unused
+  class LayoutOrStrideMNL,
+  class SmemLayoutAtom, // Unused
+  class CopyOpS2R,      // Unused
+  int Alignment,
+  bool EnableNullptr
+>
+struct Sm90AuxLoad<
+  0, EpilogueTile, Element, LayoutOrStrideMNL, 
+  SmemLayoutAtom, CopyOpS2R, Alignment, EnableNullptr
+> {
+  using ElementAux = Element;
+  using StrideMNL = cutlass::gemm::TagToStrideC_t<LayoutOrStrideMNL>;
+
+  struct SharedStorage { };
+
+  struct Arguments {
+    Element const* ptr_aux = nullptr;
+    Element null_default = Element(0);
+    StrideMNL dAux = {};
+  };
+
+  using Params = Arguments;
+
+  template <class ProblemShape>
+  static constexpr Params
+  to_underlying_arguments(ProblemShape const& problem_shape, Arguments const& args, void* workspace) {
+    return args;
+  }
+
+  template <class ProblemShape>
+  static bool
+  can_implement(ProblemShape const& problem_shape, Arguments const& args) {
+    return true;
+  }
+
+  template <class ProblemShape>
+  static size_t
+  get_workspace_size(ProblemShape const& problem_shape, Arguments const& args) {
+    return 0;
+  }
+
+  template <class ProblemShape>
+  static cutlass::Status
+  initialize_workspace(ProblemShape const& problem_shape, Arguments const& args, void* workspace, cudaStream_t stream,
+    CudaHostAdapter* cuda_adapter = nullptr) {
+    return cutlass::Status::kSuccess;
+  }
+
+  CUTLASS_HOST_DEVICE
+  Sm90AuxLoad() { }
+
+  CUTLASS_HOST_DEVICE
+  Sm90AuxLoad(Params const& params, SharedStorage const& shared_storage)
+    : params_ptr(&params) { }
+  
+  Params const* params_ptr;
+
+  CUTLASS_DEVICE bool
+  is_producer_load_needed() const {
+    return false;
+  }
+
+  CUTLASS_DEVICE bool
+  is_C_load_needed() const {
+    return false;
+  }
+
+  template <class... Args>
+  CUTLASS_DEVICE auto
+  get_producer_load_callbacks(ProducerLoadArgs<Args...> const& args) {
+    return EmptyProducerLoadCallbacks{};
+  }
+
+  template<
+    class GTensorG2R,
+    class RTensor,
+    class CTensorG2R,
+    class ProblemShapeMNL
+  >
+  struct ConsumerStoreCallbacks : EmptyConsumerStoreCallbacks {
+    CUTLASS_DEVICE
+    ConsumerStoreCallbacks(GTensorG2R&& tC_gAux,
+        RTensor&& tC_rAux,
+        CTensorG2R&& tC_cAux,
+        ProblemShapeMNL problem_shape_mnl,
+        Params const* params_ptr)
+      : tC_gAux(cute::forward<GTensorG2R>(tC_gAux)),
+        tC_rAux(cute::forward<RTensor>(tC_rAux)),
+        tC_cAux(cute::forward<CTensorG2R>(tC_cAux)),
+        problem_shape_mnl(problem_shape_mnl),
+        params_ptr(params_ptr) {}
+    
+    GTensorG2R tC_gAux;
+    RTensor tC_rAux;
+    CTensorG2R tC_cAux;
+    ProblemShapeMNL problem_shape_mnl;
+    Params const* params_ptr;
+
+    CUTLASS_DEVICE void
+    begin_loop(int epi_m, int epi_n) {
+      if constexpr (EnableNullptr) {
+        if (params_ptr->ptr_aux == nullptr) {
+          fill(tC_rAux, params_ptr->null_default);
+          return;
+        }
+      }
+      constexpr auto MCL = decltype(max_common_layout(tC_gAux(_,_,_,_0{},_0{}), tC_rAux)){};
+      constexpr int V = cute::min(Alignment, size(MCL));
+
+      Tensor tC_cAux_mn = tC_cAux(_,_,_,epi_m,epi_n);
+      Tensor tC_cAux_vec = tensor<1>(zipped_divide(coalesce(tC_cAux_mn), MCL.compose(Int<V>{})));
+      
+      Tensor tC_gAux_vec = recast<Array<Element, V>>(coalesce(tC_gAux(_,_,_,epi_m,epi_n)));
+      Tensor tC_rAux_vec = recast<Array<Element, V>>(coalesce(tC_rAux));
+
+      auto pred_fn = [&] (auto const&... coords) {
+        return elem_less(tC_cAux_vec(coords...), problem_shape_mnl);
+      };
+
+      copy_if(pred_fn, tC_gAux_vec, tC_rAux_vec);
+    }
+
+    template <typename ElementAccumulator, int FragmentSize>
+    CUTLASS_DEVICE Array<Element, FragmentSize>
+    visit(Array<ElementAccumulator, FragmentSize> const& frg_acc, int epi_v, int epi_m, int epi_n) {
+      return recast<Array<Element, FragmentSize>>(tC_rAux)(epi_v);
+    }
+  };
+
+  template <
+    bool ReferenceSrc,
+    class... Args
+  >
+  CUTLASS_DEVICE auto
+  get_consumer_store_callbacks(ConsumerStoreArgs<Args...> const& args) {
+    auto [M, N, K, L] = args.problem_shape_mnkl;
+    auto [m, n, k, l] = args.tile_coord_mnkl;
+
+    auto problem_shape_mnl = make_shape(M,N,L);
+
+    // Gmem Tensor
+    Tensor mAux = make_tensor(
+      make_gmem_ptr(params_ptr->ptr_aux), make_shape(M,N,L), params_ptr->dAux
+    );
+    Tensor tC_gAux = sm90_partition_for_epilogue<ReferenceSrc>(
+      mAux, args.tile_shape_mnk, args.tile_coord_mnkl, args.epi_tile, args.tiled_copy, args.thread_idx);
+
+    // Register Tensor
+    Tensor tC_rAux = make_tensor<Element>(take<0,3>(shape(tC_gAux)));
+
+    // Predication support
+    Tensor coordAux = make_identity_tensor(shape(mAux));
+    Tensor tC_cAux = sm90_partition_for_epilogue<ReferenceSrc>(
+      coordAux, args.tile_shape_mnk, args.tile_coord_mnkl, args.epi_tile, args.tiled_copy, args.thread_idx);
+
+    return ConsumerStoreCallbacks<decltype(tC_gAux), decltype(tC_rAux), decltype(tC_cAux), decltype(problem_shape_mnl)>(
+      cute::move(tC_gAux),
+      cute::move(tC_rAux),
+      cute::move(tC_cAux),
+      problem_shape_mnl,
+      params_ptr
+    );
+  }
+};
+
 /////////////////////////////////////////////////////////////////////////////////////////////////
 //
 // Broadcast Load Operations
@@ -387,11 +556,12 @@ struct Sm90AuxLoad {
 // Supports reduction over multiple broadcasts to support fusions such as fp8 scaling factors
 template<
   class Element,
-  class StrideMNL = Stride<_0,_0,_0>,
+  class StrideMNL_ = Stride<_0,_0,_0>,
   int BroadcastCount = 1,
   template <class> class ReductionFn = multiplies
 >
 struct Sm90ScalarBroadcast {
+  using StrideMNL = StrideMNL_;
   static_assert(is_static_v<decltype(take<0,2>(StrideMNL{}))>); // batch stride can be dynamic or static
   static_assert(take<0,2>(StrideMNL{}) == Stride<_0,_0>{});
 
@@ -400,7 +570,7 @@ struct Sm90ScalarBroadcast {
   struct Arguments {
     Element scalars[BroadcastCount] = {};
     Element const* scalar_ptrs[BroadcastCount] = {};
-    StrideMNL dScalar = {};
+    StrideMNL dScalar[BroadcastCount] = {};
   };
 
   using Params = Arguments;
@@ -443,7 +613,21 @@ struct Sm90ScalarBroadcast {
   // This must be called after update_scalar is called
   CUTLASS_DEVICE bool
   is_zero() const {
-    return scalar == Element(0);
+    if (get<2>(params_ptr->dScalar[0]) == 0) { 
+      // Only 1 batch
+      return scalar == Element(0);
+    }
+    else { 
+      // multiple batch
+      if (valid_scalar == false) {
+        // for stridedBatch kernel, if ptr has a valid address, we need to enable the epi_load warps.
+        return params_ptr->scalar_ptrs[0] == nullptr;
+      }
+      else {
+        // Check whether each batch is ZERO or not.
+        return scalar == Element(0);
+      }
+    }
   }
 
   CUTLASS_HOST_DEVICE
@@ -453,19 +637,20 @@ struct Sm90ScalarBroadcast {
   Sm90ScalarBroadcast(Params const& params, SharedStorage const& shared_storage)
       : params_ptr(&params) {
     // Get the scalar for non-batched broadcast
-    if (get<2>(params_ptr->dScalar) == 0) {
+    if (size<2>(params_ptr->dScalar[0]) == 0) {
       update_scalar();
     }
   }
 
   Element scalar;
+  bool valid_scalar = false;
   Params const* params_ptr;
 
   template <class... Args>
   CUTLASS_DEVICE auto
   get_producer_load_callbacks(ProducerLoadArgs<Args...> const& args) {
     // Get the scalar for batched broadcast
-    if (get<2>(params_ptr->dScalar) != 0) {
+    if (size<2>(params_ptr->dScalar[0]) != 0) {
       auto [m_coord, n_coord, k_coord, l_coord] = args.tile_coord_mnkl;
       update_scalar(l_coord);
     }
@@ -499,7 +684,7 @@ struct Sm90ScalarBroadcast {
   get_consumer_store_callbacks(ConsumerStoreArgs<Args...> const& args) {
 
     // Get the scalar for batched broadcast
-    if (get<2>(params_ptr->dScalar) != 0) {
+    if (get<2>(params_ptr->dScalar[0]) != 0) {
       auto [m_coord, n_coord, k_coord, l_coord] = args.tile_coord_mnkl;
       update_scalar(l_coord);
     }
@@ -510,11 +695,13 @@ struct Sm90ScalarBroadcast {
 private:
   CUTLASS_DEVICE void
   update_scalar(int l_coord = 0) {
-    int l_offset = l_coord * size<2>(params_ptr->dScalar);
+    valid_scalar = true;
+    int l_offset = l_coord * size<2>(params_ptr->dScalar[0]);
 
     if (params_ptr->scalar_ptrs[0] != nullptr) {
       scalar = params_ptr->scalar_ptrs[0][l_offset];
-    } else {
+    } 
+    else {
       // batch stride is ignored for nullptr fallback
       scalar = params_ptr->scalars[0];
     }
@@ -524,8 +711,10 @@ struct Sm90ScalarBroadcast {
     CUTLASS_PRAGMA_UNROLL
     for (int i = 1; i < BroadcastCount; ++i) {
       if (params_ptr->scalar_ptrs[i] != nullptr) {
-        scalar = reduction_fn(scalar, params_ptr->scalar_ptrs[i][l_offset]);
-      } else {
+        int rest_l_offset = l_coord * size<2>(params_ptr->dScalar[i]);
+        scalar = reduction_fn(scalar, params_ptr->scalar_ptrs[i][rest_l_offset]);
+      } 
+      else {
         // batch stride is ignored for nullptr fallback
         scalar = reduction_fn(scalar, params_ptr->scalars[i]);
       }
@@ -536,11 +725,175 @@ struct Sm90ScalarBroadcast {
   CUTLASS_DEVICE void
   update_scalar(cute::tuple<Xs...>) {
     // Only support multiple L-modes with fully-broadcast scalar
-    static_assert(cute::is_same_v<StrideMNL, Stride<_0,_0, _0>>);
     scalar = params_ptr->scalars[0];
+    valid_scalar = true;
+  }
+};
+
+// Scalar broadcast
+// Supports reduction over multiple broadcasts to support fusions such as fp8 scaling factors
+template<
+  class Element,
+  class StrideMNL = Stride<_0,_0,_0>,
+  int BroadcastCount = 1,
+  template <class> class ReductionFn = multiplies
+>
+struct Sm90ScalarBroadcastPtrArray {
+  static_assert(is_static_v<decltype(take<0,2>(StrideMNL{}))>); // batch stride can be dynamic or static
+  static_assert(take<0,2>(StrideMNL{}) == Stride<_0,_0>{});
+
+  struct SharedStorage { };
+
+  struct Arguments {
+    Element scalars[BroadcastCount] = {};
+    Element const* scalar_ptrs[BroadcastCount] = {};
+    Element const* const* scalar_ptr_arrays[BroadcastCount] = {};
+    StrideMNL dScalar[BroadcastCount] = {};
+  };
+
+  using Params = Arguments;
+
+  template <class ProblemShape>
+  static constexpr Params
+  to_underlying_arguments(ProblemShape const& problem_shape, Arguments const& args, void* workspace) {
+    return args;
+  }
+
+  template <class ProblemShape>
+  static bool
+  can_implement(ProblemShape const& problem_shape, Arguments const& args) {
+    return true;
+  }
+  
+  template <class ProblemShape>
+  static size_t
+  get_workspace_size(ProblemShape const& problem_shape, Arguments const& args) {
+    return 0;
+  }
+
+  template <class ProblemShape>
+  static cutlass::Status
+  initialize_workspace(ProblemShape const& problem_shape, Arguments const& args, void* workspace, cudaStream_t stream,
+    CudaHostAdapter *cuda_adapter = nullptr) {
+    return cutlass::Status::kSuccess;
+  }
+
+  CUTLASS_DEVICE bool
+  is_producer_load_needed() const {
+    // producer load is needed if Element is not void and we have multiple scalars
+    return !cute::is_void_v<Element> and size<2>(params_ptr->dScalar[0]) != 0;
+  }
+
+  CUTLASS_DEVICE bool
+  is_C_load_needed() const {
+    return false;
+  }
+
+  // This must be called after update_scalar is called
+  CUTLASS_DEVICE bool
+  is_zero() const {
+    return scalar == Element(0);
+  }
+
+  CUTLASS_HOST_DEVICE
+  Sm90ScalarBroadcastPtrArray() { }
+
+  CUTLASS_HOST_DEVICE
+  Sm90ScalarBroadcastPtrArray(Params const& params, SharedStorage const& shared_storage)
+      : params_ptr(&params) {
+    // Get the scalar for non-batched broadcast
+    if (size<2>(params_ptr->dScalar[0]) == 0) {
+      update_scalar();
+    }
+  }
+
+  Element scalar;
+  Params const* params_ptr;
+
+  template <class... Args>
+  CUTLASS_DEVICE auto
+  get_producer_load_callbacks(ProducerLoadArgs<Args...> const& args) {
+    // Get the scalar for batched broadcast
+    if (get<2>(params_ptr->dScalar[0]) != 0) {
+      auto [m_coord, n_coord, k_coord, l_coord] = args.tile_coord_mnkl;
+      update_scalar(l_coord);
+    }
+
+    return EmptyProducerLoadCallbacks{};
+  }
+
+  struct ConsumerStoreCallbacks : EmptyConsumerStoreCallbacks {
+    CUTLASS_DEVICE
+    ConsumerStoreCallbacks(Element scalar)
+      : scalar(scalar) {}
+
+    Element scalar;
+
+    template <typename ElementAccumulator, int FragmentSize>
+    CUTLASS_DEVICE Array<Element, FragmentSize>
+    visit(Array<ElementAccumulator, FragmentSize> const& frg_acc, int epi_v, int epi_m, int epi_n) {
+      Array<Element, FragmentSize> frg_scalar;
+      frg_scalar.fill(scalar);
+
+      return frg_scalar;
+    }
+
+  };
+
+  template <
+    bool ReferenceSrc, // do register tensors reference the src or dst layout of the tiled copy
+    class... Args
+  >
+  CUTLASS_DEVICE auto
+  get_consumer_store_callbacks(ConsumerStoreArgs<Args...> const& args) {
+
+    // Get the scalar for batched broadcast
+    if (get<2>(params_ptr->dScalar[0]) != 0) {
+      auto [m_coord, n_coord, k_coord, l_coord] = args.tile_coord_mnkl;
+      update_scalar(l_coord);
+    }
+
+    return ConsumerStoreCallbacks(scalar);
+  }
+
+private:
+  CUTLASS_DEVICE void
+  update_scalar(int l_coord = 0) {
+    int l_offset = l_coord * size<2>(params_ptr->dScalar[0]);
+
+    if (params_ptr->scalar_ptr_arrays[0] != nullptr) {
+      scalar = *(params_ptr->scalar_ptr_arrays[0][l_offset]);
+    }
+    else if (params_ptr->scalar_ptrs[0] != nullptr) {
+      scalar = params_ptr->scalar_ptrs[0][l_offset];
+    }
+    else {
+      // batch stride is ignored for nullptr fallback
+      scalar = params_ptr->scalars[0];
+    }
+
+    // Do reduction over multiple broadcasts if necessary
+    ReductionFn<Element> reduction_fn;
+    CUTLASS_PRAGMA_UNROLL
+    for (int i = 1; i < BroadcastCount; ++i) {
+
+      if (params_ptr->scalar_ptr_arrays[i] != nullptr) {
+        int rest_l_offset = l_coord * size<2>(params_ptr->dScalar[i]);
+        scalar = reduction_fn(scalar, *(params_ptr->scalar_ptr_arrays[i][rest_l_offset]));
+      }
+      if (params_ptr->scalar_ptrs[i] != nullptr) {
+        int rest_l_offset = l_coord * size<2>(params_ptr->dScalar[i]);
+        scalar = reduction_fn(scalar, params_ptr->scalar_ptrs[i][rest_l_offset]);
+      } 
+      else {
+        // batch stride is ignored for nullptr fallback
+        scalar = reduction_fn(scalar, params_ptr->scalars[i]);
+      }
+    }
   }
 };
 
+
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
 namespace detail {
@@ -557,32 +910,40 @@ compute_row_broadcast_stages() {
 template<
   int Stages,
   class CtaTileShapeMNK,
-  class Element,
-  class StrideMNL = Stride<_0,_1,_0>,
-  int Alignment = 128 / sizeof_bits_v<Element>,
+  class ElementInput,
+  class ElementCompute = ElementInput,
+  class StrideMNL_ = Stride<_0,_1,_0>,
+  int Alignment = 128 / sizeof_bits_v<ElementInput>,
   bool EnableNullptr = true // Fallback scalar broadcast for nullptr params
 >
 struct Sm90RowBroadcast {
-  static_assert(Stages == 0, "Row broadcast doesn't support smem usage");
-  static_assert(is_static_v<decltype(take<0,2>(StrideMNL{}))>); // batch stride can be dynamic or static
-  static_assert(take<0,2>(StrideMNL{}) == Stride<_0,_1>{});
+  using StrideMNL = StrideMNL_;
+  static_assert(Stages == 0, "Row broadcast doesn't support smem pipelining");
+
+  static constexpr bool IsDynamicBroadcast = is_same_v<remove_cvref_t<decltype(get<1>(StrideMNL{}))>, bool>; // row vector or scalar broadcast
+  static_assert(is_static_v<decltype(take<0,2>(StrideMNL{}))> || IsDynamicBroadcast); // batch stride can be dynamic or static
+  static_assert(take<0,2>(StrideMNL{}) == Stride<_0,_1>{} || IsDynamicBroadcast);
 
   struct SharedStorage { 
-    array_aligned<Element, size<1>(CtaTileShapeMNK{})> smem;
+    array_aligned<ElementInput, size<1>(CtaTileShapeMNK{})> smem;
   };
 
   struct Arguments {
-    Element const* ptr_row = nullptr;
-    Element null_default = Element(0);
+    ElementInput const* ptr_row = nullptr;
+    ElementInput null_default = ElementInput(0);
     StrideMNL dRow = {};
   };
 
-  using Params = Arguments;
+  struct Params {
+    ElementInput const* ptr_row = nullptr;
+    ElementCompute null_default = ElementCompute(0);
+    StrideMNL dRow = {};
+  };
 
   template <class ProblemShape>
   static constexpr Params
   to_underlying_arguments(ProblemShape const& problem_shape, Arguments const& args, void* workspace) {
-    return args;
+    return {args.ptr_row, ElementCompute(args.null_default), args.dRow};
   }
 
   template <class ProblemShape>
@@ -609,11 +970,22 @@ struct Sm90RowBroadcast {
 
   CUTLASS_HOST_DEVICE
   Sm90RowBroadcast(Params const& params, SharedStorage const& shared_storage)
-      : params(params)
-      , smem(const_cast<Element*>(shared_storage.smem.data())) { }
+      : params(params), is_zero_(false),
+        smem(const_cast<ElementInput*>(shared_storage.smem.data())) {
+    auto const& [stride_M, stride_N, stride_L] = params.dRow;
+    // Nullptr default
+    if (EnableNullptr && params.ptr_row == nullptr) {
+      is_zero_ = params.null_default == ElementCompute(0);
+    }
+    // Dynamic non-batched scalar broadcast
+    else if (IsDynamicBroadcast && stride_N == bool(0) && stride_L == repeat_like(stride_L, 0)) {
+      is_zero_ = params.ptr_row[0] == ElementInput(0);
+    }
+  }
 
   Params params;
-  Element *smem = nullptr;
+  bool is_zero_ = false;
+  ElementInput *smem = nullptr;
 
   CUTLASS_DEVICE bool
   is_producer_load_needed() const {
@@ -627,7 +999,7 @@ struct Sm90RowBroadcast {
 
   CUTLASS_DEVICE bool
   is_zero() const {
-    return (params.ptr_row == nullptr && params.null_default == Element(0));
+    return is_zero_;
   }
 
   template <class... Args>
@@ -636,24 +1008,27 @@ struct Sm90RowBroadcast {
     return EmptyProducerLoadCallbacks{};
   }
 
-  template <class GS_GTensor, class GS_STensor, class GS_CTensor, class Tiled_G2S, class SR_STensor, class SR_RTensor, class CTensor, class ThrResidue, class ThrNum>
+  template <class GS_GTensor, class GS_STensor, class GS_CTensor, class Tiled_G2S, class SR_STensor, class SR_RTensor, class Residue, class ThrNum>
   struct ConsumerStoreCallbacks : EmptyConsumerStoreCallbacks {
     CUTLASS_DEVICE
     ConsumerStoreCallbacks(
         GS_GTensor tGS_gRow_, GS_STensor tGS_sRow_, 
         GS_CTensor tGS_cRow_, Tiled_G2S tiled_g2s_, 
         SR_STensor tSR_sRow_, SR_RTensor tSR_rRow_,
-        CTensor tCcRow_, ThrResidue residue_tCcRow_, ThrNum thr_num_, Params const& params_)
+        Residue residue_cRow_, ThrNum thr_num_, Params const& params_)
       : tGS_gRow(tGS_gRow_)
       , tGS_sRow(tGS_sRow_)
       , tGS_cRow(tGS_cRow_)
       , tiled_G2S(tiled_g2s_)
       , tSR_sRow(tSR_sRow_)
       , tSR_rRow(tSR_rRow_)
-      , tCcRow(tCcRow_)
-      , residue_tCcRow(residue_tCcRow_)
+      , residue_cRow(residue_cRow_)
       , params(params_)
-      , is_nullptr(EnableNullptr && params_.ptr_row == nullptr) {}
+      , is_nullptr(EnableNullptr && params_.ptr_row == nullptr) {
+      if (is_nullptr) {
+        fill(tSR_rRow, params.null_default);
+      }
+    }
 
     GS_GTensor tGS_gRow;                                                         // (CPY,CPY_M,CPY_N)
     GS_STensor tGS_sRow;                                                         // (CPY,CPY_M,CPY_N)
@@ -663,35 +1038,31 @@ struct Sm90RowBroadcast {
     SR_STensor tSR_sRow;                                                         // (CPY,CPY_M,CPY_N,EPI_M,EPI_N)
     SR_RTensor tSR_rRow;                                                         // (CPY,CPY_M,CPY_N,EPI_M,EPI_N) 
   
-    CTensor tCcRow;                                                              // (CPY,CPY_M,CPY_N,EPI_M,EPI_N)
-    ThrResidue residue_tCcRow;                                                   // (m, n)
+    Residue residue_cRow;                                                        // (m, n)
     ThrNum thr_num;
     Params const& params;
     bool is_nullptr;
 
     CUTLASS_DEVICE void
     begin() {
-      if constexpr (EnableNullptr) {
-        if (params.ptr_row == nullptr) {
-          fill(tSR_rRow, params.null_default);
-          return;
-        }
+      if (is_nullptr) {
+        return;
       }
 
       auto synchronize = [&] () { cutlass::arch::NamedBarrier::sync(thr_num, cutlass::arch::ReservedNamedBarriers::EpilogueBarrier); };
       Tensor tGS_gRow_flt = filter_zeros(tGS_gRow);
       Tensor tGS_sRow_flt = filter_zeros(tGS_sRow);
-      Tensor tGS_cRow_flt = make_tensor(tGS_cRow.data(), make_layout(tGS_gRow_flt.shape(), tGS_cRow.stride()));
+      Tensor tGS_cRow_flt = filter_zeros(tGS_cRow, tGS_gRow.stride());
 
       for (int i = 0; i < size(tGS_gRow_flt); ++i) {
         if (get<1>(tGS_cRow_flt(i)) >= size<1>(CtaTileShapeMNK{})) {
           continue; // OOB of SMEM, 
         }
-        if (elem_less(tGS_cRow_flt(i), make_coord(get<0>(residue_tCcRow), get<1>(residue_tCcRow)))) {
+        if (elem_less(tGS_cRow_flt(i), residue_cRow)) {
           tGS_sRow_flt(i) = tGS_gRow_flt(i);
         }
         else {
-          tGS_sRow_flt(i) = Element(0); // Set to Zero when OOB so LDS could be issue without any preds.
+          tGS_sRow_flt(i) = ElementInput(0); // Set to Zero when OOB so LDS can be issued without any preds.
         }
       }
       synchronize();
@@ -699,18 +1070,28 @@ struct Sm90RowBroadcast {
 
     CUTLASS_DEVICE void
     begin_loop(int epi_m, int epi_n) {
-      if (epi_m == 0) { // Assumes M-major subtile loop
-        if (is_nullptr) return; // Do not issue LDS when bias is nullptr
+      if (epi_m == 0 and not is_nullptr) { // Assumes M-major subtile loop
         Tensor tSR_sRow_flt = filter_zeros(tSR_sRow(_,_,_,epi_m,epi_n));
-        Tensor tSR_rRow_flt = filter_zeros(tSR_rRow);
-        copy(tSR_sRow_flt, tSR_rRow_flt);
+        Tensor tSR_rRow_flt = make_tensor_like<ElementInput>(tSR_sRow_flt);
+        copy_aligned(tSR_sRow_flt, tSR_rRow_flt);
+
+        constexpr int FrgSize = size(tSR_rRow_flt);
+        using FrgInput = Array<ElementInput, FrgSize>;
+        using FrgCompute = Array<ElementCompute, FrgSize>;
+        using ConvertInput = NumericArrayConverter<ElementCompute, ElementInput, FrgSize>;
+
+        Tensor tSR_rRow_input_frg = recast<FrgInput>(coalesce(tSR_rRow_flt));
+        Tensor tSR_rRow_compute_frg = recast<FrgCompute>(filter(tSR_rRow));
+        ConvertInput convert_input{};
+
+        tSR_rRow_compute_frg(_0{}) = convert_input(tSR_rRow_input_frg(_0{}));
       }
     }
 
     template <typename ElementAccumulator, int FragmentSize>
-    CUTLASS_DEVICE Array<Element, FragmentSize>
+    CUTLASS_DEVICE Array<ElementCompute, FragmentSize>
     visit(Array<ElementAccumulator, FragmentSize> const& frg_acc, int epi_v, int epi_m, int epi_n) {
-      Array<Element, FragmentSize> frg_row;
+      Array<ElementCompute, FragmentSize> frg_row;
 
       CUTLASS_PRAGMA_UNROLL
       for (int i = 0; i < FragmentSize; ++i) {
@@ -731,12 +1112,30 @@ struct Sm90RowBroadcast {
     auto [m, n, k, l] = args.tile_coord_mnkl;
     using ThreadCount = decltype(size(args.tiled_copy));
 
-    Tensor mRow = make_tensor(make_gmem_ptr(params.ptr_row), make_shape(M,N,L), params.dRow);
+    auto layout_N = [&] () {
+      auto shape_N = get<1>(args.problem_shape_mnkl);
+      if constexpr (IsDynamicBroadcast) {
+        auto stride_N = repeat_like(shape_N, int(0));
+        if (get<1>(params.dRow) == bool(1)) {
+          stride_N = transform_leaf(compact_major<LayoutLeft>(shape_N),
+            [] (auto const& stride) { return static_cast<int>(stride); }
+          );
+        }
+        return make_layout(shape_N, stride_N);
+      }
+      else {
+        return make_layout(shape_N);
+      }
+    }();
+
+    auto layout_M = make_layout(M, repeat_like(M, _0{}));
+    auto layout_L = make_layout(L, get<2>(params.dRow));
+    Tensor mRow = make_tensor(make_gmem_ptr(params.ptr_row), make_layout(layout_M,layout_N,layout_L));
     Tensor gRow = local_tile(mRow(_,_,l), take<0,2>(args.tile_shape_mnk), make_coord(m, n));          // (CTA_M, CTA_N)
     Tensor sRow = make_tensor(make_smem_ptr(smem), 
         make_shape(size<0>(CtaTileShapeMNK{}), size<1>(CtaTileShapeMNK{})), make_shape(_0{}, _1{}));  // (CTA_M, CTA_N)
     //// G2S: Gmem to Smem
-    auto tiled_g2s = make_tiled_copy(Copy_Atom<DefaultCopy, Element>{},
+    auto tiled_g2s = make_tiled_copy(Copy_Atom<DefaultCopy, ElementInput>{},
                                      Layout< Shape<_1, ThreadCount>, 
                                             Stride<_0,          _1>>{}, 
                                      Layout<_1>{});   
@@ -745,20 +1144,18 @@ struct Sm90RowBroadcast {
     Tensor tGS_sRow = thr_g2s.partition_D(sRow);
 
     //// G2S: Coord 
-    auto cRow = make_identity_tensor(make_shape(size<0>(CtaTileShapeMNK{}), size<1>(CtaTileShapeMNK{})));
-    Tensor tGS_cRow = thr_g2s.partition_S(cRow);
+    Tensor tGS_cRow = thr_g2s.partition_S(args.cD);
 
     //// S2R: Smem to Reg
     Tensor tSR_sRow = sm90_partition_for_epilogue<ReferenceSrc>(sRow, args.epi_tile, args.tiled_copy, args.thread_idx);
-    Tensor tSR_rRow = make_tensor_like(take<0,3>(tSR_sRow));                                           // (CPY,CPY_M,CPY_N)
+    Tensor tSR_rRow = make_tensor_like<ElementCompute>(take<0,3>(tSR_sRow));                        // (CPY,CPY_M,CPY_N)
 
-    return ConsumerStoreCallbacks<decltype(tGS_gRow), decltype(tGS_sRow), decltype(tGS_cRow), decltype(tiled_g2s), decltype(tSR_sRow), decltype(tSR_rRow), decltype(args.tCcD), decltype(args.residue_cD), ThreadCount>(
+    return ConsumerStoreCallbacks(
       tGS_gRow, 
       tGS_sRow, 
       tGS_cRow, tiled_g2s, 
       tSR_sRow, 
       tSR_rRow, 
-      args.tCcD, 
       args.residue_cD,
       ThreadCount{}, 
       params);
@@ -771,31 +1168,39 @@ struct Sm90RowBroadcast {
 template<
   int Stages,
   class CtaTileShapeMNK,
-  class Element,
-  class StrideMNL = Stride<_1,_0,_0>,
-  int Alignment = 128 / sizeof_bits_v<Element>,
+  class ElementInput,
+  class ElementCompute = ElementInput,
+  class StrideMNL_ = Stride<_1,_0,_0>,
+  int Alignment = 128 / sizeof_bits_v<ElementInput>,
   bool EnableNullptr = true // Fallback scalar broadcast for nullptr params
 >
 struct Sm90ColBroadcast {
-  static_assert(Stages == 0, "Column broadcast doesn't support smem usage");
-  static_assert(is_static_v<decltype(take<0,2>(StrideMNL{}))>); // batch stride can be dynamic or static
-  static_assert(take<0,2>(StrideMNL{}) == Stride<_1,_0>{});
+  using StrideMNL = StrideMNL_;
+  static_assert(Stages == 0, "Column broadcast doesn't support smem pipelining");
+
+  static constexpr bool IsDynamicBroadcast = is_same_v<remove_cvref_t<decltype(get<0>(StrideMNL{}))>, bool>; // Column vector or scalar broadcast
+  static_assert(is_static_v<decltype(take<0,2>(StrideMNL{}))> || IsDynamicBroadcast); // batch stride can be dynamic or static
+  static_assert(take<0,2>(StrideMNL{}) == Stride<_1,_0>{} || IsDynamicBroadcast);
 
   // Accumulator distributes col elements evenly amongst threads so we can just directly load from gmem
   struct SharedStorage { };
 
   struct Arguments {
-    Element const* ptr_col = nullptr;
-    Element null_default = Element(0);
+    ElementInput const* ptr_col = nullptr;
+    ElementInput null_default = ElementInput(0);
     StrideMNL dCol = {};
   };
 
-  using Params = Arguments;
+  struct Params {
+    ElementInput const* ptr_col = nullptr;
+    ElementCompute null_default = ElementCompute(0);
+    StrideMNL dCol = {};
+  };
 
   template <class ProblemShape>
   static constexpr Params
   to_underlying_arguments(ProblemShape const& problem_shape, Arguments const& args, void* workspace) {
-    return args;
+    return {args.ptr_col, ElementCompute(args.null_default), args.dCol};
   }
 
   template <class ProblemShape>
@@ -829,7 +1234,7 @@ struct Sm90ColBroadcast {
 
   CUTLASS_DEVICE bool
   is_zero() const {
-    return (params.ptr_col == nullptr && params.null_default == Element(0));
+    return is_zero_;
   }
 
   CUTLASS_HOST_DEVICE
@@ -837,9 +1242,20 @@ struct Sm90ColBroadcast {
 
   CUTLASS_HOST_DEVICE
   Sm90ColBroadcast(Params const& params, SharedStorage const& shared_storage)
-      : params(params) { }
+      : params(params), is_zero_(false) {
+    auto const& [stride_M, stride_N, stride_L] = params.dCol;
+    // Nullptr default
+    if (EnableNullptr && params.ptr_col == nullptr) {
+      is_zero_ = params.null_default == ElementCompute(0);
+    }
+    // Dynamic non-batched scalar broadcast
+    else if (IsDynamicBroadcast && stride_M == bool(0) && stride_L == repeat_like(stride_L, 0)) {
+      is_zero_ = params.ptr_col[0] == ElementInput(0);
+    }
+  }
 
   Params params;
+  bool is_zero_;
 
   template <class... Args>
   CUTLASS_DEVICE auto
@@ -850,12 +1266,16 @@ struct Sm90ColBroadcast {
   template<class GTensor, class RTensor, class CTensor, class ThrResidue>
   struct ConsumerStoreCallbacks : EmptyConsumerStoreCallbacks {
     CUTLASS_DEVICE
-    ConsumerStoreCallbacks(GTensor&& tCgCol, RTensor&& tCrCol, CTensor tCcCol, ThrResidue residue_tCcCol, Params const& params)
-      : tCgCol(cute::forward<GTensor>(tCgCol)),
-        tCrCol(cute::forward<RTensor>(tCrCol)),
-        tCcCol(tCcCol),
-        residue_tCcCol(residue_tCcCol),
-        params(params) {}
+    ConsumerStoreCallbacks(GTensor tCgCol_, RTensor tCrCol_, CTensor tCcCol_, ThrResidue residue_tCcCol_, Params const& params_)
+      : tCgCol(tCgCol_),
+        tCrCol(tCrCol_),
+        tCcCol(tCcCol_),
+        residue_tCcCol(residue_tCcCol_),
+        params(params_) {
+      if (EnableNullptr && params.ptr_col == nullptr) {
+        fill(tCrCol, params.null_default);
+      }
+    }
 
     GTensor tCgCol;                                                                    // (CPY,CPY_M,CPY_N,EPI_M,EPI_N)
     RTensor tCrCol;                                                                    // (CPY,CPY_M,CPY_N,EPI_M,EPI_N)
@@ -865,23 +1285,20 @@ struct Sm90ColBroadcast {
 
     CUTLASS_DEVICE void
     begin() {
-      if constexpr (EnableNullptr) {
-        if (params.ptr_col == nullptr) {
-          fill(tCrCol, params.null_default);
-          return;
-        }
+      if (EnableNullptr && params.ptr_col == nullptr) {
+        return;
       }
 
       // Filter so we don't issue redundant copies over stride-0 modes
       // (only works if 0-strides are in same location, which is by construction)
       Tensor tCgCol_flt = filter_zeros(tCgCol);
-      Tensor tCrCol_flt = filter_zeros(tCrCol);
-      Tensor tCcCol_flt = make_tensor(tCcCol.data(), make_layout(tCrCol_flt.shape(), tCcCol.stride()));
+      Tensor tCrCol_flt = make_tensor_like<ElementInput>(filter_zeros(tCrCol));
+      Tensor tCcCol_flt = filter_zeros(tCcCol, tCgCol.stride());
 
       constexpr auto MCL = decltype(max_common_layout(tCgCol_flt, tCrCol_flt)){};
       constexpr int V = cute::min(Alignment, size(MCL));
       if constexpr (V > 1) {
-        using VecType = uint_bit_t<V * sizeof_bits_v<Element>>;
+        using VecType = uint_bit_t<V * sizeof_bits_v<ElementInput>>;
         Tensor tCgCol_vec = recast<VecType>(coalesce(tCgCol_flt));
         Tensor tCrCol_vec = recast<VecType>(coalesce(tCrCol_flt));
         Tensor tCcCol_vec = tensor<1>(zipped_divide(tCcCol_flt, MCL.compose(Int<V>{})));
@@ -892,12 +1309,23 @@ struct Sm90ColBroadcast {
         auto pred_fn = [&] (auto const&... coords) { return elem_less(tCcCol_flt(coords...), residue_tCcCol); };
         copy_if(pred_fn, tCgCol_flt, tCrCol_flt);
       }
+
+      constexpr int FrgSize = size(tCrCol_flt);
+      using FrgInput = Array<ElementInput, FrgSize>;
+      using FrgCompute = Array<ElementCompute, FrgSize>;
+      using ConvertInput = NumericArrayConverter<ElementCompute, ElementInput, FrgSize>;
+
+      Tensor tCrCol_input_frg = recast<FrgInput>(coalesce(tCrCol_flt));
+      Tensor tCrCol_compute_frg = recast<FrgCompute>(filter(tCrCol));
+      ConvertInput convert_input{};
+
+      tCrCol_compute_frg(_0{}) = convert_input(tCrCol_input_frg(_0{}));
     }
 
     template <typename ElementAccumulator, int FragmentSize>
-    CUTLASS_DEVICE Array<Element, FragmentSize>
+    CUTLASS_DEVICE Array<ElementCompute, FragmentSize>
     visit(Array<ElementAccumulator, FragmentSize> const& frg_acc, int epi_v, int epi_m, int epi_n) {
-      Array<Element, FragmentSize> frg_col;
+      Array<ElementCompute, FragmentSize> frg_col;
       Tensor tCrCol_mn = tCrCol(_,_,_,epi_m,epi_n);
 
       CUTLASS_PRAGMA_UNROLL
@@ -918,13 +1346,34 @@ struct Sm90ColBroadcast {
   get_consumer_store_callbacks(ConsumerStoreArgs<Args...> const& args) {
 
     auto [M, N, K, L] = args.problem_shape_mnkl;
-    Tensor mCol = make_tensor(make_gmem_ptr(params.ptr_col), make_shape(M,N,L), params.dCol);
+    auto layout_M = [&] () {
+      auto shape_M = get<0>(args.problem_shape_mnkl);
+      if constexpr (IsDynamicBroadcast) {
+        auto stride_M = repeat_like(shape_M, int(0));
+        if (get<0>(params.dCol) == bool(1)) {
+          stride_M = transform_leaf(compact_major<LayoutLeft>(shape_M),
+            [] (auto const& stride) { return static_cast<int>(stride); }
+          );
+        }
+        return make_layout(shape_M, stride_M);
+      }
+      else {
+        return make_layout(shape_M);
+      }
+    }();
+
+    auto layout_N = make_layout(N, repeat_like(N, _0{}));
+    auto layout_L = make_layout(L, get<2>(params.dCol));
+    Tensor mCol = make_tensor(make_gmem_ptr(params.ptr_col), make_layout(layout_M,layout_N,layout_L));
     Tensor tCgCol = sm90_partition_for_epilogue<ReferenceSrc>(                         // (CPY,CPY_M,CPY_N,EPI_M,EPI_N)
       mCol, args.tile_shape_mnk, args.tile_coord_mnkl, args.epi_tile, args.tiled_copy, args.thread_idx);
-    Tensor tCrCol = make_tensor_like(tCgCol);                                          // (CPY,CPY_M,CPY_N,EPI_M,EPI_N)
 
-    return ConsumerStoreCallbacks<decltype(tCgCol), decltype(tCrCol), decltype(args.tCcD), decltype(args.residue_tCcD)>(
-      cute::move(tCgCol), cute::move(tCrCol), args.tCcD, args.residue_tCcD, params);
+    Tensor mCol_static = make_tensor(make_gmem_ptr(params.ptr_col), make_layout(make_layout(M),layout_N,layout_L));
+    Tensor tCgCol_static = sm90_partition_for_epilogue<ReferenceSrc>(                  // (CPY,CPY_M,CPY_N,EPI_M,EPI_N)
+      mCol_static, args.tile_shape_mnk, args.tile_coord_mnkl, args.epi_tile, args.tiled_copy, args.thread_idx);
+    Tensor tCrCol = make_tensor_like<ElementCompute>(tCgCol_static);                   // (CPY,CPY_M,CPY_N,EPI_M,EPI_N)
+
+    return ConsumerStoreCallbacks(tCgCol, tCrCol, args.tCcD, args.residue_tCcD, params);
   }
 };
 
@@ -945,6 +1394,20 @@ template <
 using Sm90MatrixBroadcast
   = Sm90AuxLoad<Stages, EpilogueTile, Element, StrideMNL, SmemLayoutAtom, CopyOpS2R, EnableNullptr>;
 
+namespace detail {
+
+template <typename Operation, typename = void>
+struct IsScalarBroadcast {
+  static constexpr bool value = false;
+};
+
+template <typename Operation>
+struct IsScalarBroadcast<Operation, cute::enable_if_t<is_same_v<decltype(take<0,2>(typename Operation::StrideMNL{})), Stride<_0,_0>>>> {
+  static constexpr bool value = true;
+};
+
+}
+
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
 } // namespace cutlass::epilogue::fusion
diff --git a/include/cutlass/epilogue/fusion/sm90_visitor_store_tma_warpspecialized.hpp b/include/cutlass/epilogue/fusion/sm90_visitor_store_tma_warpspecialized.hpp
index a94424abf1..87cc160c60 100644
--- a/include/cutlass/epilogue/fusion/sm90_visitor_store_tma_warpspecialized.hpp
+++ b/include/cutlass/epilogue/fusion/sm90_visitor_store_tma_warpspecialized.hpp
@@ -286,6 +286,185 @@ struct Sm90AuxStore {
   }
 };
 
+template <
+  class Element,
+  class EpilogueTile,   // Unused
+  FloatRoundStyle RoundStyle,
+  class LayoutOrStrideMNL,
+  class SmemLayoutAtom, // Unused
+  class CopyOpR2S,      // Unused
+  int Alignment, 
+  bool EnableNullptr
+>
+struct Sm90AuxStore<
+  0, EpilogueTile, Element, RoundStyle, LayoutOrStrideMNL, 
+  SmemLayoutAtom, CopyOpR2S, Alignment, EnableNullptr
+> {
+  using ElementAux = Element;
+  using StrideMNL = cutlass::gemm::TagToStrideC_t<LayoutOrStrideMNL>;
+
+  struct SharedStorage { };
+
+  struct Arguments {
+    Element* ptr_aux = nullptr;
+    StrideMNL dAux = {};
+  };
+
+  using Params = Arguments;
+
+  template <class ProblemShape>
+  static constexpr Params
+  to_underlying_arguments(ProblemShape const& problem_shape, Arguments const& args, void* workspace) {
+    return args;
+  }
+
+  template <class ProblemShape>
+  static bool
+  can_implement(ProblemShape const& problem_shape, Arguments const& args) {
+    return true;
+  }
+
+  template <class ProblemShape>
+  static size_t
+  get_workspace_size(ProblemShape const& problem_shape, Arguments const& args) {
+    return 0;
+  }
+
+  template <class ProblemShape>
+  static cutlass::Status
+  initialize_workspace(ProblemShape const& problem_shape, Arguments const& args, void* workspace, cudaStream_t stream,
+    CudaHostAdapter* cuda_adapter = nullptr) {
+    return cutlass::Status::kSuccess;
+  }
+
+  CUTLASS_HOST_DEVICE
+  Sm90AuxStore() { }
+
+  CUTLASS_HOST_DEVICE
+  Sm90AuxStore(Params const& params, SharedStorage const& shared_storage)
+    : params_ptr(&params) { }
+  
+  Params const* params_ptr;
+
+  CUTLASS_DEVICE bool
+  is_producer_load_needed() const {
+    return false;
+  }
+
+  CUTLASS_DEVICE bool
+  is_C_load_needed() const {
+    return false;
+  }
+
+  template <class... Args>
+  CUTLASS_DEVICE auto
+  get_producer_load_callbacks(ProducerLoadArgs<Args...> const& args) {
+    return EmptyProducerLoadCallbacks{};
+  }
+
+  template<
+    class GTensorR2G,
+    class RTensor,
+    class CTensorR2G,
+    class ProblemShapeMNL
+  >
+  struct ConsumerStoreCallbacks : EmptyConsumerStoreCallbacks {
+    CUTLASS_DEVICE
+    ConsumerStoreCallbacks(
+        GTensorR2G&& tC_gAux,
+        RTensor&& tC_rAux,
+        CTensorR2G&& tC_cAux,
+        ProblemShapeMNL problem_shape_mnl,
+        Params const* params_ptr)
+      : tC_gAux(cute::forward<GTensorR2G>(tC_gAux)),
+        tC_rAux(cute::forward<RTensor>(tC_rAux)),
+        tC_cAux(cute::forward<CTensorR2G>(tC_cAux)),
+        problem_shape_mnl(problem_shape_mnl),
+        params_ptr(params_ptr) {}
+    
+    GTensorR2G tC_gAux;
+    RTensor tC_rAux;
+    CTensorR2G tC_cAux;
+    ProblemShapeMNL problem_shape_mnl;
+    Params const* params_ptr;
+
+    template <typename ElementAccumulator, typename ElementInput, int FragmentSize>
+    CUTLASS_DEVICE auto
+    visit(Array<ElementAccumulator, FragmentSize> const& frg_acc, int epi_v, int epi_m, int epi_n,
+          Array<ElementInput, FragmentSize> const& frg_input) {
+      using ConvertInput = NumericArrayConverter<Element, ElementInput, FragmentSize, RoundStyle>;
+      ConvertInput convert_input{};
+
+      Tensor tC_rAux_frg = recast<Array<Element, FragmentSize>>(coalesce(tC_rAux));
+      tC_rAux_frg(epi_v) = convert_input(frg_input);
+
+      return frg_input;
+    }
+
+    CUTLASS_DEVICE void
+    end_loop(int epi_m, int epi_n) {
+      if constexpr (EnableNullptr) {
+        if (params_ptr->ptr_aux == nullptr) {
+          return;
+        }
+      }
+
+      constexpr auto MCL = decltype(max_common_layout(tC_gAux(_,_,_,_0{},_0{}), tC_rAux)){};
+      constexpr int V = cute::min(Alignment, size(MCL));
+
+      Tensor tC_cAux_mn = tC_cAux(_,_,_,epi_m,epi_n);
+      Tensor tC_cAux_vec = tensor<1>(zipped_divide(coalesce(tC_cAux_mn), MCL.compose(Int<V>{})));
+      
+      Tensor tC_gAux_vec = recast<Array<Element, V>>(coalesce(tC_gAux(_,_,_,epi_m,epi_n)));
+      Tensor tC_rAux_vec = recast<Array<Element, V>>(coalesce(tC_rAux));
+
+      auto pred_fn = [&] (auto const&... coords) {
+        return elem_less(tC_cAux_vec(coords...), problem_shape_mnl);
+      };
+
+      copy_if(pred_fn, tC_rAux_vec, tC_gAux_vec);
+    }
+  };
+
+  template <
+    bool ReferenceSrc,
+    class... Args
+  >
+  CUTLASS_DEVICE auto
+  get_consumer_store_callbacks(ConsumerStoreArgs<Args...> const& args) {
+
+    auto [M, N, K, L] = args.problem_shape_mnkl;
+    auto [m, n, k, l] = args.tile_coord_mnkl;
+
+    auto problem_shape_mnl = make_shape(M,N,L);
+
+    // Gmem Tensor
+    Tensor mAux = make_tensor(
+      make_gmem_ptr(params_ptr->ptr_aux), make_shape(M,N,L), params_ptr->dAux
+    );
+    Tensor tC_gAux = sm90_partition_for_epilogue<ReferenceSrc>(
+                      mAux, args.tile_shape_mnk, args.tile_coord_mnkl, args.epi_tile, args.tiled_copy, args.thread_idx);
+
+    // Register Tensor
+    Tensor tC_rAux = make_tensor<Element>(take<0,3>(shape(tC_gAux)));
+
+    // Predication support
+    Tensor coordAux = make_identity_tensor(shape(mAux));
+    Tensor tC_cAux = sm90_partition_for_epilogue<ReferenceSrc>(
+                      coordAux, args.tile_shape_mnk, args.tile_coord_mnkl, args.epi_tile, args.tiled_copy, args.thread_idx);   
+
+    return ConsumerStoreCallbacks<decltype(tC_gAux), decltype(tC_rAux), decltype(tC_cAux), decltype(problem_shape_mnl)>(
+      cute::move(tC_gAux),
+      cute::move(tC_rAux),
+      cute::move(tC_cAux),
+      problem_shape_mnl,
+      params_ptr
+    );
+
+  }
+
+};
+
 /////////////////////////////////////////////////////////////////////////////////////////////////
 //
 // Reduction Store Operations
@@ -304,10 +483,8 @@ template <
 >
 struct Sm90ScalarReduction {
 private:
-  static_assert(
-    (cute::is_same_v<StrideMNL, Stride<_0,_0, _0>>) || // scalar reduction, e.g. tensor max element
-    (cute::is_same_v<StrideMNL, Stride<_0,_0, _1>>) || // batched scalar reduction, e.g. per-batch max element
-    (cute::is_same_v<StrideMNL, Stride<_0,_0,int>>));
+  static_assert(is_static_v<decltype(take<0,2>(StrideMNL{}))>); // batch stride can be dynamic or static
+  static_assert(take<0,2>(StrideMNL{}) == Stride<_0,_0>{});
   static constexpr bool IsAtomic = is_atomic<GmemReduceFn<ElementCompute>>::value;
   static_assert(IsAtomic, "non-atomic scalar reduction not supported yet");
 
@@ -344,13 +521,16 @@ struct Sm90ScalarReduction {
   static cutlass::Status
   initialize_workspace(ProblemShape const& problem_shape, Arguments const& args, void* workspace, cudaStream_t stream,
     CudaHostAdapter* cuda_adapter = nullptr) {
+  #if !defined(CUTLASS_SKIP_REDUCTION_INIT)
     if constexpr (IsAtomic) {
-      auto [M, N, K, L] = problem_shape;
+      auto problem_shape_mnkl = append<4>(problem_shape, 1);
+      auto [M, N, K, L] = problem_shape_mnkl;
       Layout mScalar_layout = make_layout(make_shape(M,N,L), args.dScalar);
       if (args.ptr_scalar != nullptr) {
         return fill_workspace(args.ptr_scalar, ElementOutput(args.reduction_identity), cosize(mScalar_layout), stream, cuda_adapter);
       }
     }
+  #endif
 
     return cutlass::Status::kSuccess;
   }
@@ -480,15 +660,18 @@ template <
   // tensor of ElementCompute. It is the user's responsibility to reduce this to a (N, L) tensor of ElementOutput
   bool FinalReduction = true,
   // False means skip OOB predication if OOB inputs are known to be the reduction identity
-  bool VisitCheckOOB = true
+  bool VisitCheckOOB = true,
+  // Indicate the parameter order when calling RegReduceFn
+  // Seq length equals the number of RegReduceFn parameters
+  // No.0 represents tCrRow; No.1 and subsequent numbers sequentially represent frg_inputs in `visit`
+  class RegReduceSeq = cute::seq<0, 1>
 >
 struct Sm90RowReduction {
 private:
   static_assert(Stages == 0, "Smem usage not supported yet");
   static_assert(Alignment * sizeof_bits_v<ElementOutput> % 128 == 0, "sub-16B alignment not supported yet");
-  static_assert(
-    (cute::is_same_v<StrideMNL, Stride<_0,_1, _0>>) || // row vector reduction, e.g. per-col sum over all batches
-    (cute::is_same_v<StrideMNL, Stride<_0,_1,int>>));  // batched row vector reduction, e.g. per-col sum per batch
+  static_assert(is_static_v<decltype(take<0,2>(StrideMNL{}))>); // batch stride can be dynamic or static
+  static_assert(take<0,2>(StrideMNL{}) == Stride<_0,_1>{});
   static constexpr bool IsAtomic = is_atomic<GmemReduceFn<ElementCompute>>::value;
   static_assert(not (IsAtomic && not FinalReduction), "atomic reduction must be final");
 
@@ -518,7 +701,9 @@ struct Sm90RowReduction {
       reduction_buffer = nullptr;
     }
     else if constexpr (FinalReduction) {
-      auto [M, N, K, L] = problem_shape;
+      auto problem_shape_mnkl = append<4>(problem_shape, 1);
+      auto [M, N, K, L] = problem_shape_mnkl;
+
       auto [tile_M, tile_N, tile_K] = CtaTileShapeMNK{};
       size_t tile_counters_offset = product(ceil_div(make_shape(size<>(M), size<>(N), L), make_shape(tile_M, tile_N))) * tile_N * sizeof(ElementCompute);
       tile_counters_offset = round_nearest(tile_counters_offset, MinWorkspaceAlignment);
@@ -553,7 +738,8 @@ struct Sm90RowReduction {
     }
 
     size_t workspace_size = 0;
-    auto [M, N, K, L] = problem_shape;
+    auto problem_shape_mnkl = append<4>(problem_shape, 1);
+    auto [M, N, K, L] = problem_shape_mnkl;
     auto [tile_M, tile_N, tile_K] = CtaTileShapeMNK{};
     // Increment by size of reduction buffer
     workspace_size += product(ceil_div(make_shape(size<>(M),size<>(N),L), make_shape(tile_M, tile_N))) * tile_N * sizeof(ElementCompute);
@@ -567,16 +753,19 @@ struct Sm90RowReduction {
   static cutlass::Status
   initialize_workspace(ProblemShape const& problem_shape, Arguments const& args, void* workspace, cudaStream_t stream,
     CudaHostAdapter* cuda_adapter = nullptr) {
+#if !defined(CUTLASS_SKIP_REDUCTION_INIT)
+    auto problem_shape_mnkl = append<4>(problem_shape, 1);
+    auto [M, N, K, L] = problem_shape_mnkl;
     if constexpr (IsAtomic) {
-      auto [M, N, K, L] = problem_shape;
       Layout mRow_layout = make_layout(make_shape(size<>(M),size<>(N),size<>(L)), args.dRow);
       if (args.ptr_row != nullptr) {
         return fill_workspace(args.ptr_row, ElementOutput(args.reduction_identity), cosize(mRow_layout), stream, cuda_adapter);
       }
       return Status::kSuccess;
     }
-    else if constexpr (FinalReduction) {
-      auto [M, N, K, L] = problem_shape;
+    else
+#endif 
+    if constexpr (FinalReduction) {
       auto [tile_M, tile_N, tile_K] = CtaTileShapeMNK{};
       size_t tile_counters_offset = product(ceil_div(make_shape(size<>(M),size<>(N),L), make_shape(tile_M, tile_N))) * tile_N * sizeof(ElementCompute);
       tile_counters_offset = round_nearest(tile_counters_offset, MinWorkspaceAlignment);
@@ -626,14 +815,13 @@ struct Sm90RowReduction {
     Params const& params;
     bool do_final_reduction = false;
 
-
-    template <typename ElementAccumulator, typename ElementInput, int FragmentSize>
+    template <typename ElementAccumulator, typename... ElementInputs, int FragmentSize>
     CUTLASS_DEVICE auto
     visit(Array<ElementAccumulator, FragmentSize> const& frg_acc, int epi_v, int epi_m, int epi_n,
-          Array<ElementInput, FragmentSize> const& frg_input) {
+          Array<ElementInputs, FragmentSize> const&... frg_inputs) {
       if constexpr (EnableNullptr) {
         if (params.ptr_row == nullptr) {
-          return frg_input;
+          return cute::get<0>(cute::make_tuple(frg_inputs...));
         }
       }
 
@@ -643,21 +831,50 @@ struct Sm90RowReduction {
       Tensor tCrRow_mn = tCrRow(_,_,_,epi_m,epi_n);
       Tensor tCcRow_mn = tCcRow(_,_,_,epi_m,epi_n);
 
-      using ConvertInput = NumericArrayConverter<ElementCompute, ElementInput, FragmentSize, RoundStyle>;
-      using ReduceInput = RegReduceFn<ElementCompute>;
-      ConvertInput convert_input{};
-      ReduceInput reduce_input{};
+      if constexpr (VisitCheckOOB) {
+        using ReduceInput = RegReduceFn<ElementCompute>;
+        ReduceInput reduce_input{};
 
-      Array frg_I = convert_input(frg_input);
-      CUTLASS_PRAGMA_UNROLL
-      for (int i = 0; i < FragmentSize; ++i) {
-        if (!VisitCheckOOB || elem_less(tCcRow_mn(epi_v * FragmentSize + i), residue_tCcRow)) {
-          ElementCompute& tCrRow_vmn = tCrRow_mn(epi_v * FragmentSize + i);
-          tCrRow_vmn = reduce_input(tCrRow_vmn, frg_I[i]);
+        CUTLASS_PRAGMA_UNROLL
+        for (int i = 0; i < FragmentSize; ++i) {
+          if (elem_less(tCcRow_mn(epi_v * FragmentSize + i), residue_tCcRow)) {
+            ElementCompute& tCrRow_vmn = tCrRow_mn(epi_v * FragmentSize + i);
+            tCrRow_vmn = transform_apply(cute::make_tuple(frg_inputs...),
+                [&] (auto&& frg_input) {
+                  return ElementCompute(frg_input[i]);
+                },
+                [&] (auto&&... cvt_frg_inputs) {
+                  auto frg_compute_tuple = cute::make_tuple(tCrRow_vmn, cvt_frg_inputs...);
+                  return cute::detail::apply(frg_compute_tuple, reduce_input, RegReduceSeq{});
+                });
+          }
         }
       }
+      else {
+        constexpr int RegFragSize = cute::max(1, static_cast<int>(sizeof(uint32_t) / sizeof(ElementCompute)));
+        using ReduceInput = RegReduceFn<Array<ElementCompute, RegFragSize>>;
+        ReduceInput reduce_input{};
+        Tensor tCrRow_mn_frg = recast<Array<ElementCompute, RegFragSize>>(tCrRow_mn);
 
-      return frg_input;
+        constexpr int RegFragArraySize = FragmentSize / RegFragSize;
+        CUTLASS_PRAGMA_UNROLL
+        for (int i = 0; i < RegFragArraySize; ++i) {
+          Array<ElementCompute, RegFragSize>& tCrRow_vmn_frg = tCrRow_mn_frg(epi_v * RegFragArraySize + i);
+          tCrRow_vmn_frg = transform_apply(cute::make_tuple(frg_inputs...),
+              [&] (auto&& frg_input) {
+                using ElementInput = typename cute::remove_cvref_t<decltype(frg_input)>::Element;
+                using ConvertInput = NumericArrayConverter<ElementCompute, ElementInput, RegFragSize, RoundStyle>;
+                using RegFragArr = Array<Array<ElementCompute, RegFragSize>, RegFragArraySize>;
+                ConvertInput convert_input{};
+                return convert_input(reinterpret_cast<RegFragArr&>(frg_input)[i]);
+              },
+              [&] (auto&&... cvt_frg_inputs) {
+                auto frg_compute_tuple = cute::make_tuple(tCrRow_vmn_frg, cvt_frg_inputs...);
+                return cute::detail::apply(frg_compute_tuple, reduce_input, RegReduceSeq{});
+              });
+        }
+      }
+      return cute::get<0>(cute::make_tuple(frg_inputs...));
     }
 
     template <class STensor, class SyncFn, class VTensor>
@@ -683,23 +900,70 @@ struct Sm90RowReduction {
         return;
       }
 
+      int lane_m = get<0>(lane_mn);
+      [[maybe_unused]] bool is_reduced_lane = lane_m == 0;
+
       //
       // 1. Warp shuffle reduction
       //
       using FragmentShuffle = Array<ElementCompute, sizeof(uint64_t) / sizeof(ElementCompute)>;
+      Tensor tCrRow_frg = recast<FragmentShuffle>(filter(tCrRow));
       using ReduceShuffle = ShuffleReduceFn<FragmentShuffle>;
       ReduceShuffle reduce_shuffle{};
-      Tensor tCrRow_frg = recast<FragmentShuffle>(filter(tCrRow));
-      CUTLASS_PRAGMA_UNROLL
-      for (int reduction_rows = size<0>(lane_layout_MN) / 2; reduction_rows > 0; reduction_rows /= 2) {
+
+      auto FrgSizePerLaneM = size(tCrRow_frg) / size<0>(lane_layout_MN);
+      constexpr bool SwapShuffle = FrgSizePerLaneM > 0;
+
+      //
+      // Swap Shuffle
+      //
+      // The normal way to reduction among threads:
+      // use shuffle to let *** the first half of threads *** have *** whole data *** from the second half of threads.
+      // After each step of reduction, a half of threads won't work in the following steps.
+      // That is, as the reduction progresses, the efficiency of shuffle & reduction instructions gradually change from 1/2, 1/4 to 1/32 (the worst case).
+      //
+      // To overcome this shortcoming, for a NxN matrix to be reduced among N threads as a 1XN vectors,
+      // we use swap & shuffle aiming to let *** each half of threads *** have *** a half of data *** from the other half of threads.
+      // After reduction, each half of threads should deal with a (N/2)x(N/2) sub-matrix independently in the following step.
+      // We can recursively do this until the problem size is 1.
+      //
+      if constexpr (SwapShuffle) { // for a NxN matrix to be reduced among N threads as a 1XN vectors
+        Tensor tCrRow_frg_ = logical_divide(tCrRow_frg, FrgSizePerLaneM);                       // (FrgSizePerLaneM, M)
         CUTLASS_PRAGMA_UNROLL
-        for (int frg_idx = 0; frg_idx < size(tCrRow_frg); ++frg_idx) {
-          uint64_t frg_shfl = reinterpret_cast<uint64_t&>(tCrRow_frg(frg_idx));
-          frg_shfl = __shfl_down_sync(0xFFFFFFFF, frg_shfl, lane_layout_MN(reduction_rows, _0{}));
-          tCrRow_frg(frg_idx) = reduce_shuffle(tCrRow_frg(frg_idx), reinterpret_cast<FragmentShuffle&>(frg_shfl));
+        for (int m = size<1>(tCrRow_frg_) / 2; m > 0; m /= 2) {
+          CUTLASS_PRAGMA_UNROLL
+          for (int r = 0; r < m; ++r) {
+            auto frg_A = tCrRow_frg_(_,r);
+            auto frg_B = tCrRow_frg_(_,r + m);
+            CUTLASS_PRAGMA_UNROLL
+            for (int v = 0; v < size(frg_A); ++v) {
+              // Step1: swap
+              if (not (lane_m & m)) { // the first half of threads swap fragments from the first half of data to the second
+                swap(frg_A(v), frg_B(v));
+              }
+
+              // Step2: shuffle
+              uint64_t frg_shfl = reinterpret_cast<uint64_t&>(frg_A(v));
+              // each half of threads get a half of data from the other half of threads
+              frg_shfl = __shfl_xor_sync(0xFFFFFFFF, frg_shfl, lane_layout_MN(m, _0{}));
+
+              // Step3: reduction
+              frg_A(v) = reduce_shuffle(frg_B(v), reinterpret_cast<FragmentShuffle&>(frg_shfl));
+            }
+          }
+        }
+      }
+      else {
+        CUTLASS_PRAGMA_UNROLL
+        for (int reduction_rows = size<0>(lane_layout_MN) / 2; reduction_rows > 0; reduction_rows /= 2) {
+          CUTLASS_PRAGMA_UNROLL
+          for (int frg_idx = 0; frg_idx < size(tCrRow_frg); ++frg_idx) {
+            uint64_t frg_shfl = reinterpret_cast<uint64_t&>(tCrRow_frg(frg_idx));
+            frg_shfl = __shfl_down_sync(0xFFFFFFFF, frg_shfl, lane_layout_MN(reduction_rows, _0{}));
+            tCrRow_frg(frg_idx) = reduce_shuffle(tCrRow_frg(frg_idx), reinterpret_cast<FragmentShuffle&>(frg_shfl));
+          }
         }
       }
-      bool is_reduced_lane = get<0>(lane_mn) == 0;
 
       //
       // 2. Atomic reduction
@@ -708,6 +972,7 @@ struct Sm90RowReduction {
         // Filter so we don't issue redunant copies over stride-0 modes
         Tensor tCrRow_flt = filter_zeros(tCrRow);
         Tensor tCcRow_flt = make_tensor(tCcRow.data(), make_layout(tCrRow_flt.shape(), tCcRow.stride()));
+        auto FltFrgSizePerLaneM = size(tCrRow_flt) / size<0>(lane_layout_MN);
 
         Tensor tCgRow = sm90_partition_for_epilogue<ReferenceSrc>(gRow_l(_,_,l), epi_tile, tiled_copy, thread_idx);
         Tensor tCgRow_flt = filter_zeros(tCgRow);
@@ -717,11 +982,23 @@ struct Sm90RowReduction {
         ConvertOutput convert_output{};
         ReduceOutput reduce_output{};
 
-        if (is_reduced_lane) {
+        if constexpr (SwapShuffle) {
           CUTLASS_PRAGMA_UNROLL
-          for (int i = 0; i < size(tCrRow_flt); ++i) {
-            if (elem_less(tCcRow_flt(i), residue_tCcRow)) {
-              reduce_output(&tCgRow_flt(i), convert_output(tCrRow_flt(i)));
+          for (int i = 0; i < FltFrgSizePerLaneM; ++i) {
+            int idx = lane_m * FltFrgSizePerLaneM + i;
+            // Only care about OOB for N mode
+            if (get<1>(tCcRow_flt(idx)) < get<1>(residue_tCcRow)) {
+              reduce_output(&tCgRow_flt(idx), convert_output(tCrRow_flt(i)));
+            }
+          }
+        }
+        else {
+          if (is_reduced_lane) {
+            CUTLASS_PRAGMA_UNROLL
+            for (int i = 0; i < size(tCrRow_flt); ++i) {
+              if (elem_less(tCcRow_flt(i), residue_tCcRow)) {
+                reduce_output(&tCgRow_flt(i), convert_output(tCrRow_flt(i)));
+              }
             }
           }
         }
@@ -735,10 +1012,21 @@ struct Sm90RowReduction {
         // Dump warp reduction to gmem workspace
         using ElementGmem = cute::conditional_t<FinalReduction, ElementCompute volatile, ElementCompute>;
         Tensor tCgBuf = sm90_partition_for_epilogue<ReferenceSrc>(gBuf_ml(_,_,m,l), epi_tile, tiled_copy, thread_idx);
-        if (is_reduced_lane) {
-          // Filter so we don't issue redundant copies over stride-0 modes
-          // (only works if 0-strides are in same location, which is by construction)
-          copy_aligned(filter(tCrRow), recast<ElementGmem>(filter(tCgBuf)));
+
+        if constexpr (SwapShuffle) {
+          Tensor tCrRow_flt = filter(tCrRow);
+          Tensor tCgBuf_flt = recast<ElementGmem>(filter(tCgBuf));
+          auto FltFrgSizePerLaneM = size(tCrRow_flt) / size<0>(lane_layout_MN);
+          Tensor tCgBuf_flt_ = logical_divide(tCgBuf_flt, FltFrgSizePerLaneM);               // (FltFrgSizePerLaneM, M)
+          Tensor tCrRow_flt_ = logical_divide(tCrRow_flt, FltFrgSizePerLaneM);               // (FltFrgSizePerLaneM, M)
+          copy_aligned(tCrRow_flt_(_,_0{}), tCgBuf_flt_(_,lane_m));
+        }
+        else {
+          if (is_reduced_lane) {
+            // Filter so we don't issue redundant copies over stride-0 modes
+            // (only works if 0-strides are in same location, which is by construction)
+            copy_aligned(filter(tCrRow), recast<ElementGmem>(filter(tCgBuf)));
+          }
         }
         sync_fn();
       }
@@ -755,10 +1043,21 @@ struct Sm90RowReduction {
 
         // Dump warp reduction to smem workspace
         Tensor tCsBuf = sm90_partition_for_epilogue<ReferenceSrc>(sBuf(_,_,get<0>(warp_mn)), epi_tile, tiled_copy, thread_idx);
-        if (is_reduced_lane) {
-          // Filter so we don't issue redunant copies over stride-0 modes
-          // (only works if 0-strides are in same location, which is by construction)
-          copy_aligned(filter(tCrRow), filter(tCsBuf));
+
+        if constexpr (SwapShuffle) {
+          Tensor tCrRow_flt = filter(tCrRow);
+          Tensor tCsBuf_flt = filter(tCsBuf);
+          auto FltFrgSizePerLaneM = size(tCrRow_flt) / size<0>(lane_layout_MN);
+          Tensor tCsBuf_flt_ = logical_divide(tCsBuf_flt, FltFrgSizePerLaneM);               // (FltFrgSizePerLaneM, M)
+          Tensor tCrRow_flt_ = logical_divide(tCrRow_flt, FltFrgSizePerLaneM);               // (FltFrgSizePerLaneM, M)
+          copy_aligned(tCrRow_flt_(_,_0{}), tCsBuf_flt_(_,lane_m));
+        }
+        else {
+          if (is_reduced_lane) {
+            // Filter so we don't issue redunant copies over stride-0 modes
+            // (only works if 0-strides are in same location, which is by construction)
+            copy_aligned(filter(tCrRow), filter(tCsBuf));
+          }
         }
         sync_fn();
 
@@ -772,25 +1071,30 @@ struct Sm90RowReduction {
         Tensor sBuf_vec = recast<VectorSmem>(filter_zeros(sBuf));
         constexpr int FragsPerRow = decltype(size<1>(sBuf_frg))::value;
 
-        // Do the threadblock smem reduction
-        CUTLASS_PRAGMA_UNROLL
-        for (int reduction_rows = size<0>(warp_layout_MN) / 2; reduction_rows > 1; reduction_rows /= 2) {
-          int FragsPerReduction = reduction_rows * FragsPerRow;
-          CUTLASS_PRAGMA_NO_UNROLL
-          for (int frg_idx = thread_idx; frg_idx < FragsPerReduction; frg_idx += size(tiled_copy)) {
-            FragmentSmem frg_smem = reduce_smem(sBuf_frg(frg_idx), sBuf_frg(frg_idx + FragsPerReduction));
-            sBuf_vec(frg_idx) = reinterpret_cast<VectorSmem&>(frg_smem);
-          }
-          sync_fn();
-        }
+        constexpr int RowNum = decltype(size<0>(warp_layout_MN))::value;
+        using FragmentSmemArray = Array<FragmentSmem, RowNum>;
 
-        // Do final smem reduction and dump to gmem workspace
+        // Do the threadblock smem reduction
         using VectorGmem = cute::conditional_t<FinalReduction, VectorSmem volatile, VectorSmem>;
         Tensor gBuf_vec = recast<VectorGmem>(filter(gBuf_ml(_,_,m,l)));
-        CUTLASS_PRAGMA_NO_UNROLL
+        CUTLASS_PRAGMA_UNROLL
         for (int frg_idx = thread_idx; frg_idx < FragsPerRow; frg_idx += size(tiled_copy)) {
-          FragmentSmem frg_smem = reduce_smem(sBuf_frg(frg_idx), sBuf_frg(frg_idx + FragsPerRow));
-          gBuf_vec(frg_idx) = reinterpret_cast<VectorSmem&>(frg_smem);
+          FragmentSmemArray frg_smem;
+
+          CUTLASS_PRAGMA_UNROLL
+          for (int reduction_rows = 0; reduction_rows < RowNum; ++reduction_rows) {
+            int FragsCurrRows = reduction_rows * FragsPerRow;
+            frg_smem[reduction_rows] = sBuf_frg(FragsCurrRows + frg_idx);
+          }
+
+          CUTLASS_PRAGMA_UNROLL
+          for (int reduction_rows = RowNum / 2; reduction_rows > 0; reduction_rows /= 2) {
+            CUTLASS_PRAGMA_UNROLL
+            for (int row_idx = 0; row_idx < reduction_rows; ++row_idx) {
+              frg_smem[row_idx] = reduce_smem(frg_smem[row_idx], frg_smem[row_idx + reduction_rows]);
+            }
+          }
+          gBuf_vec(frg_idx) = reinterpret_cast<VectorSmem&>(frg_smem[0]);
         }
         sync_fn();
       }
@@ -959,9 +1263,8 @@ struct Sm90ColReduction {
 private:
   static_assert(Stages == 0, "Smem usage not supported yet");
   static_assert(Alignment * sizeof_bits_v<ElementOutput> % 128 == 0, "sub-16B alignment not supported yet");
-  static_assert(
-    (cute::is_same_v<StrideMNL, Stride<_1,_0, _0>>) || // col vector reduction, e.g. per-row sum over all batches
-    (cute::is_same_v<StrideMNL, Stride<_1,_0,int>>));  // batched col vector reduction, e.g. per-row sum per batch
+  static_assert(is_static_v<decltype(take<0,2>(StrideMNL{}))>); // batch stride can be dynamic or static
+  static_assert(take<0,2>(StrideMNL{}) == Stride<_1,_0>{});
   static constexpr bool IsAtomic = is_atomic<GmemReduceFn<ElementCompute>>::value;
   static_assert(not (IsAtomic && not FinalReduction), "atomic reduction must be final");
 
@@ -991,7 +1294,9 @@ struct Sm90ColReduction {
       reduction_buffer = nullptr;
     }
     else if constexpr (FinalReduction) {
-      auto [M, N, K, L] = problem_shape;
+      auto problem_shape_mnkl = append<4>(problem_shape, 1);
+      auto [M, N, K, L] = problem_shape_mnkl;
+
       auto [tile_M, tile_N, tile_K] = CtaTileShapeMNK{};
       size_t tile_counters_offset = product(ceil_div(make_shape(M,N,L), make_shape(tile_M, tile_N))) * tile_M * sizeof(ElementCompute);
       tile_counters_offset = round_nearest(tile_counters_offset, MinWorkspaceAlignment);
@@ -1026,7 +1331,8 @@ struct Sm90ColReduction {
     }
 
     size_t workspace_size = 0;
-    auto [M, N, K, L] = problem_shape;
+    auto problem_shape_mnkl = append<4>(problem_shape, 1);
+    auto [M, N, K, L] = problem_shape_mnkl;
     auto [tile_M, tile_N, tile_K] = CtaTileShapeMNK{};
 
     // Increment by size of reduction buffer
@@ -1042,16 +1348,19 @@ struct Sm90ColReduction {
   static cutlass::Status
   initialize_workspace(ProblemShape const& problem_shape, Arguments const& args, void* workspace, cudaStream_t stream,
     CudaHostAdapter* cuda_adapter = nullptr) {
+#if !defined(CUTLASS_SKIP_REDUCTION_INIT)
+    auto problem_shape_mnkl = append<4>(problem_shape, 1);
+    auto [M, N, K, L] = problem_shape_mnkl;
     if constexpr (IsAtomic) {
-      auto [M, N, K, L] = problem_shape;
       Layout mCol_layout = make_layout(make_shape(size<>(M),size<>(N),size<>(L)), args.dCol);
       if (args.ptr_col != nullptr) {
         return fill_workspace(args.ptr_col, ElementOutput(args.reduction_identity), cosize(mCol_layout), stream, cuda_adapter);
       }
       return Status::kSuccess;
     }
-    else if constexpr (FinalReduction) {
-      auto [M, N, K, L] = problem_shape;
+    else
+#endif 
+    if constexpr (FinalReduction) {
       auto [tile_M, tile_N, tile_K] = CtaTileShapeMNK{};
       size_t tile_counters_offset = product(ceil_div(make_shape(M,N,L), make_shape(tile_M, tile_N))) * tile_M * sizeof(ElementCompute);
       tile_counters_offset = round_nearest(tile_counters_offset, MinWorkspaceAlignment);
diff --git a/include/cutlass/epilogue/fusion/sm90_visitor_tma_warpspecialized.hpp b/include/cutlass/epilogue/fusion/sm90_visitor_tma_warpspecialized.hpp
index 843640127d..4f7d99fa32 100644
--- a/include/cutlass/epilogue/fusion/sm90_visitor_tma_warpspecialized.hpp
+++ b/include/cutlass/epilogue/fusion/sm90_visitor_tma_warpspecialized.hpp
@@ -170,7 +170,7 @@ struct ConsumerStoreArgs {
   Residue residue_cD;
   ThrCoordTensor tCcD;
   ThrResidue residue_tCcD;
-  ThrSrcTensor const& tCrC;
+  ThrSrcTensor & tCrC;
   int thread_idx;
 
   CUTLASS_DEVICE
@@ -185,7 +185,7 @@ struct ConsumerStoreArgs {
       Residue residue_cD,
       ThrCoordTensor tCcD,
       ThrResidue residue_tCcD,
-      ThrSrcTensor const& tCrC,
+      ThrSrcTensor & tCrC,
       int thread_idx)
   : problem_shape_mnkl(problem_shape_mnkl),
     tile_shape_mnk(tile_shape_mnk),
@@ -361,14 +361,12 @@ struct Sm90VisitorImpl : Sm90VisitorImplBase<Ops...> {
     // Callbacks can store non-persistent variables (e.g. tensors) or copies of persistent variables
     CallbacksTuple callbacks_tuple;
 
-    // Before entry of the subtile load loop. Bulk copies usually performed here.
-    // Upon entry the producer_acquire of the first subtile lock has completed.
-    // full_mbarrier_ptr is the corresponding barrier for the subsequent producer_commit arrival
+    // Before entry of the subtile load loop
     CUTLASS_DEVICE void
-    begin(uint64_t* full_mbarrier_ptr, int load_iteration, bool issue_tma_load) {
+    begin() {
       for_each(callbacks_tuple,
         [&] (auto& callbacks) {
-          callbacks.begin(full_mbarrier_ptr, load_iteration, issue_tma_load);
+          callbacks.begin();
         }
       );
     }
diff --git a/include/cutlass/epilogue/fusion/sm90_visitor_topk_softmax.hpp b/include/cutlass/epilogue/fusion/sm90_visitor_topk_softmax.hpp
new file mode 100644
index 0000000000..4624974432
--- /dev/null
+++ b/include/cutlass/epilogue/fusion/sm90_visitor_topk_softmax.hpp
@@ -0,0 +1,769 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+/*! \file
+  \brief Visitor tree Top-K + Softmax fusion operation for sm90 TMA warp-specialized epilogue
+*/
+
+#pragma once
+
+#include "cutlass/cutlass.h"
+#include "cutlass/workspace.h"
+
+#include "cute/tensor.hpp"
+#include "sm90_visitor_tma_warpspecialized.hpp"
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+namespace cutlass::epilogue::fusion {
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// Top-K + Softmax reduction across columns
+// Performs a reduction of top-K values across N, and finally performs a softmax on them,
+// and sets values not in the top-K to 0.
+//
+//   Assumptions:
+//     1. CTA_N >= N (single tile across N, the mode which is reduced)
+//     2. EPI_N >= N (single epilogue tile across N, because we can reduce and revisit one
+//        epilogue tile at a time.)
+//     3. Top-K value is either 2 or 4.
+//
+
+namespace detail {
+
+// Implementations for add to sorted list and merging sorted lists,
+// with fast paths for lists of size 2 and 4 (Top-2 and Top-4).
+// Generic implementations may result in greater register use and branching,
+// and should be avoided.
+// Fast paths for Top-2 and Top-4 are written in inline PTX directly.
+
+CUTLASS_DEVICE
+Array<float, 2> top_2_reduce_scalar(Array<float, 2> a, float scalar) {
+  Array<float, 2> out;
+#if defined(__CUDA_ARCH__) || defined(__SYCL_CUDA_ARCH__)
+  asm volatile(
+      "{\n"
+      "  .reg .f32 mx;\n"
+      "  .reg .pred p;\n"
+      "  max.f32 mx, %3, %4;\n"
+      "  setp.gtu.f32 p, %2, %4;\n"
+      "  selp.f32 %1, mx, %2, p;\n"
+      "  selp.f32 %0, %2, %4, p;\n"
+      "}\n" : "=f"(out[0]), "=f"(out[1]) : "f"(a[0]), "f"(a[1]), "f"(scalar));
+#endif
+  return out;
+}
+
+CUTLASS_DEVICE
+Array<float, 2> top_2_reduce(Array<float, 2> a, Array<float, 2> b) {
+  Array<float, 2> out;
+#if defined(__CUDA_ARCH__) || defined(__SYCL_CUDA_ARCH__)
+  asm volatile(
+      "{\n"
+      "  .reg .v2 .f32 mx;\n"
+      "  .reg .pred p;\n"
+      "  max.f32 mx.x, %3, %4;\n"           // max(a1, b0)
+      "  max.f32 mx.y, %2, %5;\n"           // max(a0, b1)
+      "  setp.gtu.f32 p, %2, %4;\n"         // a0 > b0
+      "  selp.f32 %1, mx.x, mx.y, p;\n"     // a0 > b0 ? max(a1, b0) : max(a0, b1)
+      "  selp.f32 %0, %2, %4, p;\n"         // a0 > b0 ? a0 : b0
+      "}\n" : "=f"(out[0]), "=f"(out[1]) : 
+      "f"(a[0]), "f"(a[1]), "f"(b[0]), "f"(b[1]));
+#endif
+  return out;
+}
+
+CUTLASS_DEVICE
+Array<float, 4> top_4_reduce_scalar(Array<float, 4> a, float scalar) {
+  Array<float, 4> out;
+#if defined(__CUDA_ARCH__) || defined(__SYCL_CUDA_ARCH__)
+  asm volatile(
+      "{\n"
+      "  .reg .f32 mx;\n"                   // max(a3, b)
+      "  .reg .pred p0;\n"                  // a0 > b
+      "  .reg .pred p1;\n"                  // a1 > b
+      "  .reg .pred p2;\n"                  // a2 > b
+      "  max.f32 mx, %7, %8;\n"             // max(a3, b)
+      "  setp.gtu.f32 p0, %4, %8;\n"        // a0 > b
+      "  setp.gtu.f32 p1, %5, %8;\n"        // a1 > b
+      "  setp.gtu.f32 p2, %6, %8;\n"        // a2 > b
+      "  selp.f32 %3, mx, %6, p2;\n"        // a2 > b ? max(a3, b) : a2
+      "  selp.f32 %2, %6, %8, p2;\n"        // a1 = a2 > b ? a2 : b
+      "  selp.f32 %2, %2, %5, p1;\n"        // a1 > b ? max(a2, b) : a1 == a1 > b ? a1 : old_a1
+      "  selp.f32 %1, %5, %8, p1;\n"        // a0 = a1 > b ? a1 : b
+      "  selp.f32 %1, %1, %4, p0;\n"        // a0 > b ? max(a1, b) : a0 == a0 > b ? a0 : old_a0
+      "  selp.f32 %0, %4, %8, p0;\n"        // a0 = a0 > b ? a0 : b
+      "}\n" : 
+      "=f"(out[0]), "=f"(out[1]), "=f"(out[2]), "=f"(out[3]) : 
+      "f"(a[0]), "f"(a[1]), "f"(a[2]), "f"(a[3]), "f"(scalar));
+#endif
+  return out;
+}
+
+CUTLASS_DEVICE
+Array<float, 4> top_4_reduce(Array<float, 4> a, Array<float, 4> b) {
+  Array<float, 4> out;
+#if defined(__CUDA_ARCH__) || defined(__SYCL_CUDA_ARCH__)
+  asm volatile(
+      "{\n"
+      "  .reg .f32 mxa0b1;\n"                          // max(a0, b1)
+      "  .reg .f32 mxa1b0;\n"                          // max(a1, b0)
+
+      "  .reg .f32 mxa2b0;\n"                          // max(a2, b0)
+      "  .reg .f32 mxa1b1;\n"                          // max(a1, b1)
+      "  .reg .f32 mxa0b2;\n"                          // max(a1, b1)
+
+      "  .reg .f32 mxa1b2;\n"                          // max(a1, b2)
+      "  .reg .f32 mxa2b1;\n"                          // max(a2, b1)
+      "  max.f32 mxa1b2, %5, %10;\n"
+      "  max.f32 mxa2b1, %6, %9;\n"
+
+      "  .reg .f32 mxa3b0;\n"                          // max(a1, b2)
+      "  .reg .f32 mxa0b3;\n"                          // max(a2, b1)
+      "  max.f32 mxa3b0, %7, %8;\n"
+      "  max.f32 mxa0b3, %4, %11;\n"
+
+      "  .reg .pred pa0b0;\n"                          // a0 > b0
+      "  .reg .pred pa1b0;\n"                          // a1 > b0
+      "  .reg .pred pa2b0;\n"                          // a2 > b0
+      "  .reg .pred pa0b1;\n"                          // a0 > b1
+      "  .reg .pred pa1b1;\n"                          // a1 > b1
+      "  .reg .pred pa0b2;\n"                          // a0 > b2
+      "  .reg .pred pb2a0;\n"                          // b1 > a0
+      "  .reg .pred pb1a0;\n"                          // b1 > a0
+
+      "  setp.gtu.f32 pa0b0, %4, %8;\n"                // a0 > b0
+      "  setp.gtu.f32 pa1b0, %5, %8;\n"                // a1 > b0
+      "  setp.gtu.f32 pa2b0, %6, %8;\n"                // a2 > b0
+      "  setp.gtu.f32 pa0b1, %4, %9;\n"                // a0 > b1
+      "  setp.gtu.f32 pa1b1, %5, %9;\n"                // a1 > b1
+      "  setp.gtu.f32 pa0b2, %4, %10;\n"               // a0 > b2
+
+      "  not.pred pb2a0, pa0b2;\n"
+      "  not.pred pb1a0, pa0b1;\n"
+
+      "  selp.f32 mxa1b0, %5, %8, pa1b0;\n"            // max(a1, b0)
+      "  selp.f32 mxa0b1, %4, %9, pa0b1;\n"            // max(a0, b1)
+
+      "  selp.f32 mxa1b1, %5, %9, pa1b1;\n"            // max(a1, b1)
+      "  selp.f32 mxa2b0, %6, %8, pa2b0;\n"            // max(a2, b0)
+      "  selp.f32 mxa0b2, %4, %10, pa0b2;\n"           // max(a0, b2)
+
+      // a0
+      "  selp.f32 %0, %4, %8, pa0b0;\n"                // a0 = a0 > b0 ? a0 : b0
+
+      // a1
+      "  selp.f32 %1, mxa1b0, mxa0b1, pa0b0;\n"        // a1 = a0 > b0 ? max(a1, b0) : max(a0, b1)
+
+      // a2
+      "  mov.f32 %2, mxa1b1;\n"                        // a2 = max(a1, b1) ** most likely case
+      "  selp.f32 %2, mxa2b0, %2, pa1b0;\n"            // a0 > a1 > b0
+      "  selp.f32 %2, mxa0b2, %2, pb1a0;\n"            // b0 > b1 > a0
+
+      // a3
+      "  mov.f32 %3, mxa1b2;\n"                        // a3 = max(a1, b2) ** one of the most likely cases
+      "  selp.f32 %3, mxa2b1, %3, pa1b1;\n"            // a3 = a1 > b1 ? max(a2, b1) ** second most likely case
+      "  selp.f32 %3, mxa3b0, %3, pa2b0;\n"            // a0 > a1 > a2 > b0
+      "  selp.f32 %3, mxa0b3, %3, pb2a0;\n"            // b0 > b1 > b2 > a0
+      "}\n" : 
+      "=f"(out[0]), "=f"(out[1]), "=f"(out[2]), "=f"(out[3]) : 
+      "f"(a[0]), "f"(a[1]), "f"(a[2]), "f"(a[3]),
+      "f"(b[0]), "f"(b[1]), "f"(b[2]), "f"(b[3]));
+#endif
+  return out;
+}
+
+// Assumption: array elements are sorted in descending order
+// (a[0] is the largest element in a[].)
+template <typename Element, int N>
+CUTLASS_DEVICE
+void add_element_to_desc_sorted_array(cutlass::Array<Element, N>& a, Element b) {
+  if constexpr (N == 2 && is_same_v<Element, float>) {
+    a = top_2_reduce_scalar(a, b);
+  }
+  else if constexpr (N == 4 && is_same_v<Element, float>) {
+    a = top_4_reduce_scalar(a, b);
+  }
+  else {
+    // slower generic path with branching, slower, and can cause register spill
+    CUTLASS_PRAGMA_UNROLL
+    for (int k = 0; k < N; ++k) {
+      if (a[k] <= b) {
+        // Shift down
+        CUTLASS_PRAGMA_UNROLL
+        for (int l = N - 1; l > k; --l) {
+          a[l] = a[l-1];
+        }
+        a[k] = b;
+      }
+    }
+  }
+}
+
+// Assumption: array elements are sorted in descending order
+// (a[0] and b[0] are the largest elements in a[] and b[].)
+template <typename Element, int N>
+CUTLASS_DEVICE
+void merge_desc_sorted_arrays(cutlass::Array<Element, N>& a, const cutlass::Array<Element, N>& b) {
+  if constexpr (N == 2 && is_same_v<Element, float>) {
+    a = top_2_reduce(a, b);
+  }
+  else if constexpr (N == 4 && is_same_v<Element, float>) {
+    a = top_4_reduce(a, b);
+  }
+  else {
+    // slower generic path with branching, slower, and can cause register spill
+    int j = 0;
+    CUTLASS_PRAGMA_UNROLL
+    for (int k = 0; k < N; ++k) {
+      if (a[k] <= b[j]) {
+        // Shift down
+        CUTLASS_PRAGMA_UNROLL
+        for (int l = N - 1; l > k; --l) {
+          a[l] = a[l-1];
+        }
+        a[k] = b[j];
+        ++j;
+      }
+    }
+  }
+}
+
+// Assumption: array elements are sorted in descending order
+// (a[0] is the largest element in a[].)
+template <typename Element, int N>
+CUTLASS_DEVICE
+Element topk_logsumexp(cutlass::Array<Element, N> a) {
+  // Do one less `exp`, because we know what its result will be.
+  // Assume x is a set of `x_i`s, and `x_m` is the maximum of that set.
+  // logsumexp(x) = log(sum(x_i)) = m + log(sum(x_i - m)) = m + log(1 + sum_{i != m}(x_i - x_m))
+  // Compute m + log(1 + sum_{i != m}(x_i - x_m))
+  Element sum = Element(1.0);
+  CUTLASS_PRAGMA_UNROLL
+  for (int i = 1; i < N; ++i) {
+    sum += fast_exp(a[i] - a[0]);
+  }
+  return a[0] + fast_log(sum);
+}
+
+CUTLASS_DEVICE
+float fast_masked_softmax(float value, float minimum, float logsumexp) {
+  float new_value;
+#if defined(__CUDA_ARCH__) || defined(__SYCL_CUDA_ARCH__)
+  asm volatile(
+      "{\n"
+      "  .reg .pred p0;\n"
+      // value >= minimum
+      "  setp.geu.f32 p0, %1, %2;\n"
+
+      "  .reg .f32 x_lse;\n"
+      "  .reg .f32 %%f<11>;\n"
+      "  .reg .b32 %%r<3>;\n"
+
+      // x_lse = value - minimum
+      "  sub.rn.f32  x_lse, %1, %3;\n"
+
+      // exp(x_lse)
+      // The following is derived from a ptx dump of expf.
+      // exp requires a base conversion from exp2.
+      "  fma.rn.f32 %%f1, x_lse, 0f3BBB989D, 0f3F000000;\n"
+      "  cvt.sat.f32.f32 %%f2, %%f1;\n"
+      "  fma.rm.f32 %%f3, %%f2, 0f437C0000, 0f4B400001;\n"
+      "  add.f32 %%f4, %%f3, 0fCB40007F;\n"
+      "  neg.f32 %%f5, %%f4;\n"
+      "  fma.rn.f32 %%f6, x_lse, 0f3FB8AA3B, %%f5;\n"
+      "  fma.rn.f32 %%f7, x_lse, 0f32A57060, %%f6;\n"
+      "  mov.b32 %%r1, %%f3;\n"
+      "  shl.b32 %%r2, %%r1, 23;\n"
+      "  mov.b32 %%f8, %%r2;\n"
+      "  ex2.approx.ftz.f32 %%f9, %%f7;\n"
+      "  mul.f32 %%f10, %%f9, %%f8;\n"
+
+      // Mask or softmax
+      "  selp.f32 %0, %%f10, 0f00000000, p0;\n"
+      "}\n" : "=f"(new_value) : "f"(value), "f"(minimum), "f"(logsumexp));
+#endif
+  return new_value;
+}
+
+template <typename Element>
+CUTLASS_DEVICE
+Element masked_softmax(Element value, Element minimum, Element logsumexp) {
+  if constexpr (is_same_v<Element, float>) {
+    // Inline PTX implementation
+    // Significantly reduces register requirements
+    return fast_masked_softmax(value, minimum, logsumexp);
+  }
+  else {
+    return value < minimum ? Element(0.0) : fast_exp(value - logsumexp);
+  }
+}
+
+} // namespace detail
+
+template <
+  int TopK,
+  int FragmentSize,
+  class CtaTileShapeMNK,
+  class EpilogueTile,
+  class ElementOutput,
+  class ElementCompute,
+  FloatRoundStyle RoundStyle,
+  int Alignment = 128 / sizeof_bits_v<ElementOutput>,
+  bool UseButterflyReduce = true
+>
+struct Sm90TopKSoftmaxColReduction {
+private:
+  static_assert(is_same_v<ElementCompute, float>, "Fused Top-K + Softmax reduction requires FP32 accumulation.");
+  static_assert(TopK == 2 || TopK == 4, "Fused Top-K + Softmax reduction only supports K=2 and K=4.");
+  static_assert(Alignment * sizeof_bits_v<ElementOutput> % 128 == 0, "sub-16B alignment not supported yet");
+
+  // Reduction tensors
+  //   We have two tensors for this EVT node: a reduction tensor and a tensor holding
+  //   final reduction values (tCrSoftmax). The reason for this is that Top-K and Softmax
+  //   require different reductions, but those luckily overlap. Top-K obviously needs at least
+  //   two values (K >= 2), and softmax needs one value: logsumexp. Logsumexp is simply the log
+  //   of sum of exponents over the set, and is equivalent to m + sum(exp(x_i - m)), where m is the
+  //   maximum of all x_i elements. Since safe softmax for any element x_i is computed as
+  //   softmax(x_i) = exp(x_i - m) / sum_j(exp(x_j - max))
+  //   we can track logsumexp instead of tracking two variables (sum of exps and the max).
+  //   In addition, subtracting logsumexp from any element and taking its exp is equivalent to
+  //   computing its softmax.
+  //   
+  //   The overlap between softmax and top-K is that we don't need to reduce logsumexp along the
+  //   way at all, because any element not in the top-K is going to be masked out and set to 0.
+  //   Therefore, we only reduce the top-K elements, and when done, compute their logsumexp and
+  //   keep it, and the smallest element in the top-K for masking out non-top-K elements.
+  //
+  //   This means that our final reduction result will always be 2 elements, regardless of the value
+  //   of K: minimum of top-K, and logsumexp.
+  //
+  //   For each reduction tensor, we define a new struct for readability.
+
+  struct ReductionResult {
+    ElementCompute min_;
+    ElementCompute logsumexp_;
+
+    CUTLASS_DEVICE
+    ReductionResult() { }
+
+    CUTLASS_DEVICE
+    ReductionResult(ElementCompute min, ElementCompute logsumexp): 
+      logsumexp_(logsumexp), min_(min) { }
+
+    // Warp shuffle broadcast
+    CUTLASS_DEVICE
+    void shuffle_up_sync(uint32_t delta, int lane_id) {
+      static_assert(sizeof(ReductionResult) == sizeof(uint64_t));
+      uint64_t r = reinterpret_cast<uint64_t&>(*this);
+      r = shfl_up_sync(0xFFFFFFFF, r, delta);
+      *this = (lane_id - static_cast<int>(delta) >= 0) ? reinterpret_cast<ReductionResult&>(r) : *this;
+    }
+  };
+
+  struct TopKResult {
+    Array<ElementCompute, TopK> top_k_;
+
+    CUTLASS_DEVICE
+    TopKResult() {
+      top_k_.fill(-cutlass::platform::numeric_limits<ElementCompute>::infinity());
+    }
+
+    // This is where we do the "final" reduction, where we compute
+    // the logsumexp for softmax, keep the smallest value in top-K,
+    // and discard the rest.
+    CUTLASS_DEVICE
+    ReductionResult reduce_final() const {
+      return ReductionResult(top_k_[TopK - 1], topk_logsumexp(top_k_));
+    }
+
+    // Butterfly reduction
+    CUTLASS_DEVICE
+    void shuffle_xor_sync(int laneMask) {
+      if constexpr (TopK == 2) {
+        static_assert(sizeof(TopKResult) == sizeof(uint64_t));
+        uint64_t top_k = reinterpret_cast<uint64_t&>(*this);
+        top_k = shfl_xor_sync(0xFFFFFFFF, top_k, laneMask);
+        auto synced_v = reinterpret_cast<TopKResult&>(top_k);
+        detail::merge_desc_sorted_arrays(top_k_, synced_v.top_k_);
+      }
+      else if constexpr (TopK == 4) {
+        static_assert(sizeof(TopKResult) == 2 * sizeof(uint64_t));
+        uint64_t* top_k_ptr = reinterpret_cast<uint64_t*>(this);
+        uint64_t top_k_arr[2];
+        top_k_arr[0] = top_k_ptr[0];
+        top_k_arr[1] = top_k_ptr[1];
+        top_k_arr[0] = shfl_xor_sync(0xFFFFFFFF, top_k_arr[0], laneMask);
+        top_k_arr[1] = shfl_xor_sync(0xFFFFFFFF, top_k_arr[1], laneMask);
+        auto synced_v = reinterpret_cast<TopKResult&>(top_k_arr);
+        detail::merge_desc_sorted_arrays(top_k_, synced_v.top_k_);
+      }
+      else {
+        TopKResult synced_v;
+        CUTLASS_PRAGMA_UNROLL
+        for (int i = 0; i < TopK; ++i) {
+          synced_v.top_k_[i] = shfl_xor_sync(0xFFFFFFFF, top_k_[i], laneMask);
+        }
+        detail::merge_desc_sorted_arrays(top_k_, synced_v.top_k_);
+      }
+    }
+
+    // Warp shuffle reduction
+    CUTLASS_DEVICE
+    void shuffle_down_sync(uint32_t delta) {
+      if constexpr (TopK == 2) {
+        static_assert(sizeof(TopKResult) == sizeof(uint64_t));
+        uint64_t top_k = reinterpret_cast<uint64_t&>(*this);
+        top_k = shfl_down_sync(0xFFFFFFFF, top_k, delta);
+        auto synced_v = reinterpret_cast<TopKResult&>(top_k);
+        detail::merge_desc_sorted_arrays(top_k_, synced_v.top_k_);
+      }
+      else if constexpr (TopK == 4) {
+        static_assert(sizeof(TopKResult) == 2 * sizeof(uint64_t));
+        uint64_t* top_k_ptr = reinterpret_cast<uint64_t*>(this);
+        uint64_t top_k_arr[2];
+        top_k_arr[0] = top_k_ptr[0];
+        top_k_arr[1] = top_k_ptr[1];
+        top_k_arr[0] = shfl_down_sync(0xFFFFFFFF, top_k_arr[0], delta);
+        top_k_arr[1] = shfl_down_sync(0xFFFFFFFF, top_k_arr[1], delta);
+        auto synced_v = reinterpret_cast<TopKResult&>(top_k_arr);
+        detail::merge_desc_sorted_arrays(top_k_, synced_v.top_k_);
+      }
+      else {
+        TopKResult synced_v;
+        CUTLASS_PRAGMA_UNROLL
+        for (int i = 0; i < TopK; ++i) {
+          synced_v.top_k_[i] = shfl_down_sync(0xFFFFFFFF, top_k_[i], delta);
+        }
+        detail::merge_desc_sorted_arrays(top_k_, synced_v.top_k_);
+      }
+    }
+  };
+
+public:
+  struct SharedStorage { };
+
+  struct Arguments { };
+
+  struct Params { };
+
+  template <class ProblemShape>
+  static constexpr Params
+  to_underlying_arguments(ProblemShape const& problem_shape, Arguments const& args, void* workspace) {
+    return {};
+  }
+
+  template <class ProblemShape>
+  static bool
+  can_implement(ProblemShape const& problem_shape, Arguments const& args) {
+    auto [M, N, K, L] = problem_shape;
+    auto [tile_M, tile_N, tile_K] = CtaTileShapeMNK{};
+    // Cross CTA reduction is not possible because there is no guarantee that all CTAs run
+    // concurrently.
+    // Cross epilogue tile reduction is possible, but re-visiting and applying reduction
+    // to accumulators is only possible for the current epilogue tile.
+    auto [epi_M, epi_N] = EpilogueTile{};
+    return N <= tile_N && N <= epi_N && N >= TopK;
+  }
+
+  template <class ProblemShape>
+  static size_t
+  get_workspace_size(ProblemShape const& problem_shape, Arguments const& args) {
+    return 0;
+  }
+
+  template <class ProblemShape>
+  static cutlass::Status
+  initialize_workspace(ProblemShape const& problem_shape, Arguments const& args, void* workspace, cudaStream_t stream,
+    CudaHostAdapter* cuda_adapter = nullptr) {
+    return Status::kSuccess;
+  }
+
+  CUTLASS_DEVICE bool
+  is_producer_load_needed() const {
+    return false;
+  }
+
+  CUTLASS_DEVICE bool
+  is_C_load_needed() const {
+    return false;
+  }
+
+  CUTLASS_HOST_DEVICE
+  Sm90TopKSoftmaxColReduction() { }
+
+  CUTLASS_HOST_DEVICE
+  Sm90TopKSoftmaxColReduction(Params const& params, SharedStorage const& shared_storage)
+      : params(params) { }
+
+  Params params;
+
+  template <class... Args>
+  CUTLASS_DEVICE auto
+  get_producer_load_callbacks(ProducerLoadArgs<Args...> const& args) {
+    return EmptyProducerLoadCallbacks{};
+  }
+
+  template<class ArgsTuple>
+  struct ConsumerStoreCallbacks : EmptyConsumerStoreCallbacks {
+    CUTLASS_DEVICE
+    ConsumerStoreCallbacks(ArgsTuple&& args_tuple, Params const& params)
+      : args_tuple(cute::forward<ArgsTuple>(args_tuple)),
+        params(params) {}
+
+    ArgsTuple args_tuple;
+    Params const& params;
+
+    template <typename ElementAccumulator, typename ElementInput>
+    CUTLASS_DEVICE auto
+    visit(Array<ElementAccumulator, FragmentSize> const& frg_acc, int epi_v, int epi_m, int epi_n,
+          Array<ElementInput, FragmentSize> const& frg_input) {
+
+      auto& [tCrTopK, tCrSoftmax, tCcCol, cCol, 
+              lane_layout_MN, lane_mn,
+              residue_cCol, residue_tCcCol] = args_tuple;
+      Tensor tCcCol_mn = tCcCol(_,_,_,epi_m,epi_n);
+
+      using ConvertInput = NumericArrayConverter<ElementCompute, ElementInput, FragmentSize, RoundStyle>;
+      ConvertInput convert_input{};
+
+      Array frg_I = convert_input(frg_input);
+      CUTLASS_PRAGMA_UNROLL
+      for (int i = 0; i < FragmentSize; ++i) {
+        auto thread_crd = tCcCol_mn(epi_v * FragmentSize + i);
+        if (elem_less(thread_crd, residue_tCcCol)) {
+          TopKResult& tCrCol_vmn = tCrTopK(epi_v * FragmentSize + i);
+          detail::add_element_to_desc_sorted_array(tCrCol_vmn.top_k_, frg_I[i]);
+        }
+      }
+
+      return frg_input;
+    }
+
+    template <class STensor, class SyncFn, class VTensor>
+    CUTLASS_DEVICE void
+    reduce(STensor&& smem_buffer, SyncFn const& sync_fn, int epi_m, int epi_n, bool is_last_iteration, VTensor visit_results) {
+
+      auto& [tCrTopK, tCrSoftmax, tCcCol, cCol, 
+              lane_layout_MN, lane_mn,
+              residue_cCol, residue_tCcCol] = args_tuple;
+
+      // fully OOB CTA in partially OOB cluster
+      if (not elem_less(cCol(_0{},_0{}), residue_cCol)) {
+        return;
+      }
+      Tensor tCcCol_mn = tCcCol(_,_,_,epi_m,epi_n);
+
+      // `tCrTopK` and `tCrSoftmax` have 0-strides along modes that correspond to N,
+      // in order to reduce along modes in the `R2S` sublayout that correspond to N.
+      // This means we should modify and warp-reduce them according to their co-domain instead of
+      // their domain. Therefore we keep a filtered view of both and use them as necessary.
+      auto tCrTopK_f = filter(tCrTopK);
+      auto tCrSoftmax_f = filter(tCrSoftmax);
+
+      // The pattern here is: reduce Top-K first, then compute logsumexp, keep it and the
+      // last element of Top-K, use the latter to mask the visited results, and the former
+      // to apply softmax.
+      //
+      // This gives us two options: reduce the Top-K with warp shuffles, have the reduced
+      // lanes compute logsumexp and pair it with the last Top-K element, and broadcast
+      // the result back using warp shuffles.
+      //
+      // Alternatively, we can do a butterfly reduction over Top-K, and have all lanes
+      // compute their own logsumexp and skip the broadcast.
+      if constexpr (UseButterflyReduce) {
+        //
+        // 1. Butterfly reduction
+        //
+        CUTLASS_PRAGMA_UNROLL
+        for (int j = 1; j < size<1>(lane_layout_MN); j *= 2) {
+          CUTLASS_PRAGMA_UNROLL
+          for (int i = 0; i < size(tCrTopK_f); ++i) {
+            tCrTopK_f(i).shuffle_xor_sync(j);
+          }
+        }
+
+        //
+        // 2. Strip down reduced value and compute sum of exps
+        //
+        CUTLASS_PRAGMA_UNROLL
+        for (int i = 0; i < size(tCrSoftmax_f); ++i) {
+          tCrSoftmax_f(i) = tCrTopK_f(i).reduce_final();
+        }
+      }
+      else {
+        //
+        // 1. Warp shuffle reduction
+        //
+        CUTLASS_PRAGMA_UNROLL
+        for (int reduction_cols = size<1>(lane_layout_MN) / 2; reduction_cols > 0; reduction_cols /= 2) {
+          CUTLASS_PRAGMA_UNROLL
+          for (int i = 0; i < size(tCrTopK_f); ++i) {
+            tCrTopK_f(i).shuffle_down_sync(lane_layout_MN(_0{},reduction_cols));
+          }
+        }
+
+        //
+        // 2. Strip down reduced value and compute sum of exps
+        //
+        bool is_reduced_lane = get<1>(lane_mn) == 0;
+        if (is_reduced_lane) {
+          CUTLASS_PRAGMA_UNROLL
+          for (int i = 0; i < size(tCrSoftmax_f); ++i) {
+            tCrSoftmax_f(i) = tCrTopK_f(i).reduce_final();
+          }
+        }
+
+        //
+        // 3. Broadcast reduced values to all participants
+        //
+        CUTLASS_PRAGMA_UNROLL
+        for (int broadcast_cols = 1; broadcast_cols <= size<1>(lane_layout_MN) / 2; broadcast_cols *= 2) {
+          CUTLASS_PRAGMA_UNROLL
+          for (int i = 0; i < size(tCrSoftmax_f); ++i) {
+            tCrSoftmax_f(i).shuffle_up_sync(lane_layout_MN(_0{},broadcast_cols), get<1>(lane_mn));
+          }
+        }
+      }
+
+      //
+      // 4. Re-visit and apply top-K and softmax
+      //
+      CUTLASS_PRAGMA_UNROLL
+      for (int epi_v = 0; epi_v < size(visit_results); ++epi_v) {
+        auto& visit_frag = visit_results(epi_v);
+        CUTLASS_PRAGMA_UNROLL
+        for (int i = 0; i < FragmentSize; ++i) {
+          visit_frag[i] = detail::masked_softmax(
+            visit_frag[i],
+            tCrSoftmax(epi_v * FragmentSize + i).min_,
+            tCrSoftmax(epi_v * FragmentSize + i).logsumexp_
+          );
+        }
+      }
+
+    }
+
+    CUTLASS_DEVICE void
+    end_loop(int epi_m, int epi_n) {
+      auto& [tCrTopK, tCrSoftmax, tCcCol, cCol, 
+              lane_layout_MN, lane_mn,
+              residue_cCol, residue_tCcCol] = args_tuple;
+
+      // Reset reduced top-K values for next tile
+      // This must be done because we only assume a single epilogue tile across N,
+      // but not M.
+      fill(tCrTopK, TopKResult());
+    }
+
+    CUTLASS_DEVICE void
+    end() { }
+
+  };
+
+  template <
+    bool ReferenceSrc, // do register tensors reference the src or dst layout of the tiled copy
+    class... Args
+  >
+  CUTLASS_DEVICE auto
+  get_consumer_store_callbacks(ConsumerStoreArgs<Args...> const& args) {
+    Layout ref_layout_MN = [&] () {
+      if constexpr (ReferenceSrc) { return get<0>(args.tiled_copy.get_layoutS_MN()); }
+      else                        { return get<0>(args.tiled_copy.get_layoutD_MN()); }
+    }();                                                                                         // tile_mn -> tv_idx
+
+    // Get the MN layout + coord of lanes to determine shuffle reduction iterations
+    using _W = Int<decltype(args.tiled_copy)::TiledNumThr::value / NumThreadsPerWarp>;
+    Layout tv2lane = Layout<Shape<Int<NumThreadsPerWarp>,_W,_1>,Stride<_1,_0,_0>>{};            //   tv_idx -> lane_idx
+    Layout ref2lane = composition(tv2lane, ref_layout_MN);                                      //  tile_mn -> lane_idx
+    Layout lane_layout_MN = make_layout(filter(get<0>(ref2lane)), filter(get<1>(ref2lane)));    //  lane_mn -> lane_idx
+    Layout inv_lane_layout_MN = right_inverse(lane_layout_MN);                                  // lane_idx -> lane_mn
+    int lane_idx = canonical_lane_idx();
+    auto lane_mn = idx2crd(inv_lane_layout_MN(lane_idx), shape(lane_layout_MN));
+
+    // Get the MN layout + coord of warps to determine smem reduction iterations
+    Layout tv2warp = Layout<Shape<Int<NumThreadsPerWarp>,_W,_1>,Stride<_0,_1,_0>>{};            //   tv_idx -> warp_idx
+    Layout ref2warp = composition(tv2warp, ref_layout_MN);                                      //  tile_mn -> warp_idx
+    Layout warp_layout_MN = make_layout(filter(get<0>(ref2warp)), filter(get<1>(ref2warp)));    //  warp_mn -> warp_idx
+
+    // Make sure there's only one warp across N so we can use warp shuffle intrinsics for reduction.
+    static_assert(decltype(size<1>(warp_layout_MN))::value <= 1);
+
+    // Reduction layout
+    //   We're assuming all elements in a row (over which we're performing the reduction) are
+    //   visited in the same corresponding epilogue tile, and this is what allows us to apply the
+    //   top-K + softmax operation within `reduce()`, by re-visiting the accumulated results.
+    //
+    //   This presents a challenge, because the layout of the accumulated results is typically in
+    //   in the register to shared memory shape, or: (R2S,R2S_M,R2S_N).
+    //   This means that we still need to reduce this tensor along N.
+    //
+    //   The solution is simple: we need to flatten the layout, identify modes that correspond to
+    //   N and set their strides to 0, in order to map fragment indices corresponding to the same
+    //   row back to the same element in the tensor.
+    //
+    //   This requires some extra layout manipulation, which is as follows.
+
+    // Create new accumulator layout with column broadcast
+    auto [M, N, K] = args.tile_shape_mnk;
+    auto thr_mma = args.tiled_mma.get_thread_slice(args.thread_idx);
+    auto gColReduce = make_tensor<ElementCompute>(
+        make_layout(make_shape(M, N), make_stride(_1{}, 0_c)));                                                // (M,N)
+    auto tCrColReduce = make_tensor_like<ElementCompute>(                                       // (FrgV, MMA_M, MMA_N)
+        thr_mma.partition_C(gColReduce).layout());
+
+    // Tile the new accumulator tensor according to R2S
+    ThrCopy thread_r2s = args.tiled_copy.get_slice(args.thread_idx);
+    Tensor tRS_rSoftmax = thread_r2s.retile_S(tCrColReduce);                               // ((R2S,R2S_V),MMA_M,MMA_N)
+    auto tCrC_layout = args.tCrC.layout();                                                         // (R2S,R2S_M,R2S_N)
+
+    // Compose the new accumulator R2S layout with the expected tCrC layout to get final 
+    // reduction tensor layout.
+    auto tCrSoftmax_layout = take<0, 3>(tRS_rSoftmax.layout()).compose(tCrC_layout); // (R2S,R2S_V) o (R2S,R2S_M,R2S_N)
+
+    Tensor tCrTopK = make_tensor<TopKResult>(tCrSoftmax_layout);                                   // (R2S,R2S_M,R2S_N)
+    Tensor tCrSoftmax = make_tensor<ReductionResult>(tCrSoftmax_layout);                           // (R2S,R2S_M,R2S_N)
+    fill(tCrTopK, TopKResult());
+
+    auto args_tuple = make_tuple(
+        cute::move(tCrTopK), cute::move(tCrSoftmax), args.tCcD, args.cD,
+        lane_layout_MN, lane_mn,
+        args.residue_cD, args.residue_tCcD);
+    return ConsumerStoreCallbacks<decltype(args_tuple)>(std::move(args_tuple), params);
+  }
+};
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+} // namespace cutlass::epilogue::fusion
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/include/cutlass/epilogue/fusion/xe_callbacks.hpp b/include/cutlass/epilogue/fusion/xe_callbacks.hpp
index 76fa811725..f81d8fae99 100644
--- a/include/cutlass/epilogue/fusion/xe_callbacks.hpp
+++ b/include/cutlass/epilogue/fusion/xe_callbacks.hpp
@@ -82,13 +82,18 @@ struct FusionCallbacks<
     ElementScalar const* alpha_ptr = nullptr;
     ElementScalar const* beta_ptr = nullptr;
 
+    using StrideAlpha = Stride<_0,_0,int64_t>;
+    using StrideBeta  = Stride<_0,_0,int64_t>;
+    StrideAlpha dAlpha = {_0{}, _0{}, 0};
+    StrideBeta  dBeta  = {_0{}, _0{}, 0};
+
     operator typename Impl::Arguments() const {
       return
         {    // ternary op : beta * C + (alpha * acc)
-          {{beta}, {beta_ptr}}, // leaf args : beta
+          {{beta}, {beta_ptr}, {dBeta}}, // leaf args : beta
           {},                   // leaf args : C
           {                     // binary op : alpha * acc
-            {{alpha}, {alpha_ptr}}, // leaf args : alpha
+            {{alpha}, {alpha_ptr}, {dAlpha}}, // leaf args : alpha
             {},                     // leaf args : acc
             {}                  // binary args : multiplies
           },                    // end binary op
@@ -132,6 +137,11 @@ struct FusionCallbacks<
     ElementScalar_ const* alpha_ptr = nullptr;
     ElementScalar_ const* beta_ptr = nullptr;
 
+    using StrideAlpha = Stride<_0,_0,int64_t>;
+    using StrideBeta  = Stride<_0,_0,int64_t>;
+    StrideAlpha dAlpha = {_0{}, _0{}, 0};
+    StrideBeta  dBeta  = {_0{}, _0{}, 0};
+
     using ActivationArguments = typename Sm90Compute<ActivationFn_, ElementOutput_, ElementCompute_, RoundStyle_>::Arguments;
     ActivationArguments activation = ActivationArguments();
 
@@ -139,10 +149,10 @@ struct FusionCallbacks<
       return
               {    // unary op: activation(beta * C + (alpha * acc))
                         {    // ternary op : beta * C + (alpha * acc)
-                          {{beta}, {beta_ptr}}, // leaf args : beta
+                          {{beta}, {beta_ptr}, {dBeta}}, // leaf args : beta
                           {},                   // leaf args : C
                           {                     // binary op : alpha * acc
-                                        {{alpha}, {alpha_ptr}}, // leaf args : alpha
+                                        {{alpha}, {alpha_ptr}, {dAlpha}}, // leaf args : alpha
                                         {},                     // leaf args : acc
                                         {}                  // binary args : multiplies
                           },                    // end binary op
diff --git a/include/cutlass/epilogue/thread/activation.h b/include/cutlass/epilogue/thread/activation.h
index 92407733f8..9f1cd77434 100644
--- a/include/cutlass/epilogue/thread/activation.h
+++ b/include/cutlass/epilogue/thread/activation.h
@@ -178,8 +178,9 @@ struct Clamp {
 
   CUTLASS_HOST_DEVICE
   T operator()(T const& value, T const& lower_bound, T const& upper_bound) const {
-    maximum<T> mx;
-    minimum<T> mn;
+    constexpr bool PropagateNaN = true;
+    maximum<T, PropagateNaN> mx;
+    minimum<T, PropagateNaN> mn;
 
     return mn(mx(value, lower_bound), upper_bound);
   }
@@ -196,8 +197,9 @@ struct Clamp<Array<T,N>> {
 
   CUTLASS_HOST_DEVICE
   Array<T,N> operator()(Array<T,N> const& values, T const& lower_bound, T const& upper_bound) const {
-    maximum<Array<T,N>> mx;
-    minimum<Array<T,N>> mn;
+    constexpr bool PropagateNaN = true;
+    maximum<Array<T,N>, PropagateNaN> mx;
+    minimum<Array<T,N>, PropagateNaN> mn;
 
     return mn(mx(values, lower_bound), upper_bound);
   }
@@ -226,7 +228,7 @@ struct LeakyReLU {
 
   CUTLASS_HOST_DEVICE
   T operator()(T const& value, Arguments const& args = Arguments()) const {
-    this->operator()(value, args.leaky_alpha);
+    return this->operator()(value, args.leaky_alpha);
   }
 };
 
@@ -696,6 +698,57 @@ struct dReLU_Z<Array<T, N>> {
   }
 };
 
+// ElementwiseFilter operator
+// Filters by a specific value and maps it to 0.0
+// Used in GEMM + comm
+template <typename T>
+struct ElementwiseFilter {
+
+  static const bool kIsHeavy = false;
+
+  struct Arguments {
+    T value_to_filter = T(-0.0);
+    T filtered_value = T(0.0);
+  };
+
+  CUTLASS_HOST_DEVICE
+  T operator()(T const& value, T const& value_to_filter, T const& filtered_value) const {
+    T res = value == value_to_filter ? filtered_value : value;
+    return res;
+  }
+
+  CUTLASS_HOST_DEVICE
+  T operator()(T const& value, Arguments const& args = Arguments()) const {
+    return this->operator()(value, args.value_to_filter, args.filtered_value);
+  }
+};
+
+template <typename T, int N>
+struct ElementwiseFilter<Array<T, N> > {
+
+  static const bool kIsHeavy = false;
+
+  using Arguments = typename ElementwiseFilter<T>::Arguments;
+
+  CUTLASS_HOST_DEVICE
+  Array<T, N> operator()(Array<T, N> const& values, T const& value_to_filter, T const& filtered_value) const {
+    Array<T, N> y;
+    ElementwiseFilter<T> filter_op;
+
+    CUTLASS_PRAGMA_UNROLL
+    for (int i = 0; i < int(values.size()); ++i) {
+      y[i] = filter_op(values[i], value_to_filter, filtered_value);
+    }
+
+    return y;
+  }
+
+  CUTLASS_HOST_DEVICE
+  Array<T, N> operator()(Array<T, N> const& values, Arguments const& args = Arguments()) const {
+    return this->operator()(values, args.value_to_filter, args.filtered_value);
+  }
+};
+
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
 } // namespace thread
diff --git a/include/cutlass/epilogue/thread/linear_combination_bias_elementwise.h b/include/cutlass/epilogue/thread/linear_combination_bias_elementwise.h
index 7456ae8df4..c5ffdaa03f 100644
--- a/include/cutlass/epilogue/thread/linear_combination_bias_elementwise.h
+++ b/include/cutlass/epilogue/thread/linear_combination_bias_elementwise.h
@@ -127,15 +127,20 @@ class LinearCombinationBiasElementwise {
 public:
 
   using ElementOutput = ElementC_;
+  using ElementD = ElementOutput;
   using ElementC = ElementC_;
   using ElementAccumulator = ElementAccumulator_;
   using ElementCompute = ElementCompute_;
+  using ElementScalar = ElementCompute;
   using ElementZ = ElementZ_;
   using ElementT = ElementT_;
   using ElementVector = ElementVector_;
   static int const kElementsPerAccess = ElementsPerAccess;
   static int const kCount = kElementsPerAccess;
 
+  /// Follow cutlass3x EVT aliases
+  static bool const IsEltActSupported = true;
+
   using ElementwiseOp = ElementwiseOp_;
   using BinaryOp = BinaryOp_;
 
@@ -157,7 +162,7 @@ class LinearCombinationBiasElementwise {
   using FragmentOutput = FragmentZ;
   using ElementBias = ElementVector;
   using FragmentBias = Array<ElementBias, kElementsPerAccess>;
-  using ActivationFunctor = ElementwiseOp;
+  using ActivationFn = ElementwiseOp;
   static const ScaleType::Kind kScale = ScaleType::Default;
 
   static bool const kIsHeavy = kIsHeavy_member_or_false<ElementwiseOp>::value;
@@ -396,6 +401,118 @@ class LinearCombinationBiasElementwise {
       frag_T = convert_t(result_T);
     }
   }
+
+  /// Applies the operation when elementwise_op require arguments and is_source_needed() is true
+  template <typename ElementwiseArgs>
+  CUTLASS_HOST_DEVICE
+  void operator()(
+    ElementZ &Z,
+    ElementT &T,
+    ElementAccumulator const &AB,
+    ElementC const &C,
+    ElementCompute const &V,
+    ElementwiseArgs const &elementwise_args) const {
+
+    ElementwiseOp elementwise_op;
+    BinaryOp binary_op;
+
+    ElementCompute tmp_Accum = NumericConverter<ElementCompute, ElementAccumulator>()(AB);
+    ElementCompute tmp_C = NumericConverter<ElementCompute, ElementC>()(C);
+
+    ElementCompute z = binary_op(alpha_ * tmp_Accum + beta_ * tmp_C, V);
+    ElementCompute result_Z = skip_elementwise_ ? z : elementwise_op(z, elementwise_args);
+
+    NumericConverter<ElementZ, ElementCompute> convert_z;
+    Z = convert_z(result_Z);
+
+    if constexpr (kStoreT) {
+      ElementCompute result_T = z;
+      NumericConverter<ElementT, ElementCompute> convert_t;
+      T = convert_t(result_T);
+    }
+  }
+
+  /// Applies the operation when elementwise_op require arguments and is_source_needed() is false
+  template <typename ElementwiseArgs>
+  CUTLASS_HOST_DEVICE
+  void operator()(
+    ElementZ &Z,
+    ElementT &T,
+    ElementAccumulator const &AB,
+    ElementCompute const &V,
+    ElementwiseArgs const &elementwise_args) const {
+
+    ElementwiseOp elementwise_op;
+    BinaryOp binary_op;
+
+    ElementCompute tmp_Accum = NumericConverter<ElementCompute, ElementAccumulator>()(AB);
+
+    ElementCompute z = binary_op(alpha_ * tmp_Accum, V);
+    ElementCompute result_Z = skip_elementwise_ ? z : elementwise_op(z, elementwise_args);
+
+    NumericConverter<ElementZ, ElementCompute> convert_z;
+    Z = convert_z(result_Z);
+
+    if constexpr (kStoreT) {
+      ElementCompute result_T = z;
+      NumericConverter<ElementT, ElementCompute> convert_t;
+      T = convert_t(result_T);
+    }
+  }
+
+  /// Applies the operation when is_source_needed() is true
+  CUTLASS_HOST_DEVICE
+  void operator()(
+    ElementZ &Z,
+    ElementT &T,
+    ElementAccumulator const &AB,
+    ElementC const &C,
+    ElementCompute const &V) const {
+
+    ElementwiseOpDispatcher elementwise_op(elementwise_);
+    BinaryOp binary_op;
+
+    ElementCompute tmp_Accum = NumericConverter<ElementCompute, ElementAccumulator>()(AB);
+    ElementCompute tmp_C = NumericConverter<ElementCompute, ElementC>()(C);
+
+    ElementCompute z = binary_op(alpha_ * tmp_Accum + beta_ * tmp_C, V);
+    ElementCompute result_Z = skip_elementwise_ ? z : elementwise_op(z);
+
+    NumericConverter<ElementZ, ElementCompute> convert_z;
+    Z = convert_z(result_Z);
+
+    if constexpr (kStoreT) {
+      ElementCompute result_T = z;
+      NumericConverter<ElementT, ElementCompute> convert_t;
+      T = convert_t(result_T);
+    }
+  }
+
+  /// Applies the operation when is_source_needed() is false
+  CUTLASS_HOST_DEVICE
+  void operator()(
+    ElementZ &Z,
+    ElementT &T,
+    ElementAccumulator const &AB,
+    ElementCompute const &V) const {
+
+    ElementwiseOpDispatcher elementwise_op(elementwise_);
+    BinaryOp binary_op;
+
+    ElementCompute tmp_Accum = NumericConverter<ElementCompute, ElementAccumulator>()(AB);
+
+    ElementCompute z = binary_op(alpha_ * tmp_Accum, V);
+    ElementCompute result_Z = skip_elementwise_ ? z : elementwise_op(z);
+
+    NumericConverter<ElementZ, ElementCompute> convert_z;
+    Z = convert_z(result_Z);
+
+    if constexpr (kStoreT) {
+      ElementCompute result_T = z;
+      NumericConverter<ElementT, ElementCompute> convert_t;
+      T = convert_t(result_T);
+    }
+  }
 };
 
 /////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/include/cutlass/epilogue/threadblock/default_epilogue_tensor_op.h b/include/cutlass/epilogue/threadblock/default_epilogue_tensor_op.h
index 1692cc3093..1d62f4fc35 100644
--- a/include/cutlass/epilogue/threadblock/default_epilogue_tensor_op.h
+++ b/include/cutlass/epilogue/threadblock/default_epilogue_tensor_op.h
@@ -225,6 +225,44 @@ struct DefaultIteratorsTensorOp<
   static int const kFragmentsPerIteration = 2;
 };
 
+/// Partial specialization for half <= int32_t x 8 epilogues avoids shared memory bank conflicts.
+template <
+  typename ThreadblockShape,
+  typename WarpShape,
+  typename InstructionShape,
+  typename ThreadMap
+>
+struct DefaultIteratorsTensorOp<
+  bfloat16_t,
+  int32_t,
+  8,
+  ThreadblockShape,
+  WarpShape,
+  InstructionShape,
+  ThreadMap> {
+
+  using WarpTileIterator = cutlass::epilogue::warp::TileIteratorTensorOpMixed<
+    WarpShape,
+    InstructionShape,
+    int32_t,
+    32,
+    16,
+    8,
+    8
+  >;
+
+  using SharedLoadIterator = cutlass::epilogue::threadblock::SharedLoadIteratorMixed<
+    ThreadMap,
+    int32_t,
+    32,
+    16,
+    8,
+    8
+  >;
+
+  static int const kFragmentsPerIteration = 2;
+};
+
 /// Partial specialization for half <= int32_t x 8 epilogues avoids shared memory bank conflicts.
 template <
   typename ThreadblockShape,
diff --git a/include/cutlass/epilogue/threadblock/default_thread_map_simt.h b/include/cutlass/epilogue/threadblock/default_thread_map_simt.h
index a5559f67a4..2092caf4d7 100644
--- a/include/cutlass/epilogue/threadblock/default_thread_map_simt.h
+++ b/include/cutlass/epilogue/threadblock/default_thread_map_simt.h
@@ -35,7 +35,7 @@
 
 #pragma once
 
-#include "predicated_tile_iterator.h"
+#include "cutlass/epilogue/threadblock/predicated_tile_iterator.h"
 #include "cutlass/gemm/gemm.h"
 
 /////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/include/cutlass/epilogue/threadblock/default_thread_map_tensor_op.h b/include/cutlass/epilogue/threadblock/default_thread_map_tensor_op.h
index e4198dc42d..e39ca9d539 100644
--- a/include/cutlass/epilogue/threadblock/default_thread_map_tensor_op.h
+++ b/include/cutlass/epilogue/threadblock/default_thread_map_tensor_op.h
@@ -35,7 +35,7 @@
 
 #pragma once
 
-#include "predicated_tile_iterator.h"
+#include "cutlass/epilogue/threadblock/predicated_tile_iterator.h"
 #include "cutlass/gemm/gemm.h"
 #include "cutlass/layout/pitch_linear.h"
 
diff --git a/include/cutlass/epilogue/threadblock/default_thread_map_volta_tensor_op.h b/include/cutlass/epilogue/threadblock/default_thread_map_volta_tensor_op.h
index f0ccd74e65..1eac4a1834 100644
--- a/include/cutlass/epilogue/threadblock/default_thread_map_volta_tensor_op.h
+++ b/include/cutlass/epilogue/threadblock/default_thread_map_volta_tensor_op.h
@@ -35,7 +35,7 @@
 
 #pragma once
 
-#include "predicated_tile_iterator.h"
+#include "cutlass/epilogue/threadblock/predicated_tile_iterator.h"
 #include "cutlass/gemm/gemm.h"
 
 /////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/include/cutlass/epilogue/threadblock/default_thread_map_wmma_tensor_op.h b/include/cutlass/epilogue/threadblock/default_thread_map_wmma_tensor_op.h
index 9f01d1fffc..0dccf6525a 100644
--- a/include/cutlass/epilogue/threadblock/default_thread_map_wmma_tensor_op.h
+++ b/include/cutlass/epilogue/threadblock/default_thread_map_wmma_tensor_op.h
@@ -35,7 +35,7 @@
 
 #pragma once
 
-#include "predicated_tile_iterator.h"
+#include "cutlass/epilogue/threadblock/predicated_tile_iterator.h"
 #include "cutlass/gemm/gemm.h"
 #include "cutlass/layout/pitch_linear.h"
 
diff --git a/include/cutlass/epilogue/threadblock/epilogue.h b/include/cutlass/epilogue/threadblock/epilogue.h
index 64cb2f8bd7..694734446f 100644
--- a/include/cutlass/epilogue/threadblock/epilogue.h
+++ b/include/cutlass/epilogue/threadblock/epilogue.h
@@ -512,24 +512,24 @@ class Epilogue :
           shared_load_iterator_.add_pointer_offset(kSmemPointerOffset);
           shared_load_iterator_.load(aligned_accum_fragment[i]);
           aligned_accum_fragment[0] = add_fragments(aligned_accum_fragment[0], aligned_accum_fragment[i]);
-          }
-
-          shared_load_iterator_.add_pointer_offset((1 - kPartitionsK) * kSmemPointerOffset);
         }
 
-        //
-        // Compute the output result
-        //
+        shared_load_iterator_.add_pointer_offset((1 - kPartitionsK) * kSmemPointerOffset);
+      }
 
-        typename OutputTileIterator::Fragment output_fragment;
-        source.apply_output_operator(output_fragment, output_op, aligned_accum_fragment[0]);
+      //
+      // Compute the output result
+      //
 
-        //
-        // Store the final result
-        //
+      typename OutputTileIterator::Fragment output_fragment;
+      source.apply_output_operator(output_fragment, output_op, aligned_accum_fragment[0]);
+
+      //
+      // Store the final result
+      //
 
-        destination_iterator.store(output_fragment);
-        ++destination_iterator;
+      destination_iterator.store(output_fragment);
+      ++destination_iterator;
     }
   }
 };
diff --git a/include/cutlass/float8.h b/include/cutlass/float8.h
index f710683849..cc37aadab1 100644
--- a/include/cutlass/float8.h
+++ b/include/cutlass/float8.h
@@ -582,6 +582,12 @@ struct alignas(1) float_e4m3_t : float8_base<FloatEncoding::E4M3> {
     int mantissa() const {
         return int(storage & Base::FP8_MANTISSA_MASK);
     }
+
+    CUTLASS_HOST_DEVICE
+    friend bool isnan(float_e4m3_t const& x) {
+      return x.storage == uint8_t(0x7f);
+    }
+
 };
 ///////////////////////////////////////////////////////////////
 ///
@@ -795,6 +801,12 @@ struct alignas(1) float_e5m2_t : float8_base<FloatEncoding::E5M2> {
     int mantissa() const {
         return int(storage & Base::FP8_MANTISSA_MASK);
     }
+    
+    CUTLASS_HOST_DEVICE
+    friend bool isnan(float_e5m2_t const& x) {
+      return x.storage == uint8_t(0x7f);
+    }
+
 };
 ///////////////////////////////////////////////////////////////////////////////////////////////////
 //
@@ -1262,22 +1274,22 @@ struct numeric_limits<cutlass::float_e5m2_t>  :
 //
 
 CUTLASS_HOST_DEVICE
-cutlass::float_e4m3_t operator "" _fe4m3(long double x) {
+cutlass::float_e4m3_t operator ""_fe4m3(long double x) {
   return cutlass::float_e4m3_t(float(x));
 }
 
 CUTLASS_HOST_DEVICE
-cutlass::float_e4m3_t operator "" _fe4m3(unsigned long long int x) {
+cutlass::float_e4m3_t operator ""_fe4m3(unsigned long long int x) {
   return cutlass::float_e4m3_t(int(x));
 }
 
 CUTLASS_HOST_DEVICE
-cutlass::float_e5m2_t operator "" _fe5m2(long double x) {
+cutlass::float_e5m2_t operator ""_fe5m2(long double x) {
   return cutlass::float_e5m2_t(float(x));
 }
 
 CUTLASS_HOST_DEVICE
-cutlass::float_e5m2_t operator "" _fe5m2(unsigned long long int x) {
+cutlass::float_e5m2_t operator ""_fe5m2(unsigned long long int x) {
   return cutlass::float_e5m2_t(int(x));
 }
 
diff --git a/include/cutlass/functional.h b/include/cutlass/functional.h
index 57835f8d35..c6e13c2b79 100644
--- a/include/cutlass/functional.h
+++ b/include/cutlass/functional.h
@@ -38,7 +38,6 @@
 #include "cutlass/cutlass.h"
 #include "cutlass/numeric_types.h"
 #include "cutlass/platform/platform.h"
-
 #if defined(__CUDACC_RTC__)
 #include "cutlass/floating_point_nvrtc.h"
 #endif
@@ -236,7 +235,7 @@ template <>
 struct inverse_square_root<half_t> {
   CUTLASS_HOST_DEVICE
   half_t operator()(half_t const &lhs) const {
-#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ > 520
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ > 520)
     auto result = hrsqrt(reinterpret_cast<__half const &>(lhs));
     return reinterpret_cast<half_t const &>(result);
 #else
@@ -352,7 +351,19 @@ template <typename T, bool PropagateNaN = false>
 struct maximum {
   CUTLASS_HOST_DEVICE
   T operator()(T const &lhs, T const &rhs) const {
-    return (lhs < rhs ? rhs : lhs);
+    if constexpr (PropagateNaN && cutlass::platform::is_floating_point<T>::value) {
+      using CUTLASS_CMATH_NAMESPACE :: isnan;
+
+      // Call isnan unqualified, so argument-dependent lookup (ADL)
+      // will find overloads such as cutlass::isnan(half_t).
+      // Calling ::isnan or std::isnan directly would force
+      // implicit conversions to float of custom number types
+      // in the cutlass namespace (e.g., cutlass::half_t).
+      return lhs > rhs || isnan(lhs) ? lhs : rhs;
+    }
+    else {
+      return (lhs < rhs ? rhs : lhs);
+    }
   }
 };
 
@@ -365,20 +376,6 @@ template<typename T>
 struct maximum_with_default_nan_propagation : public maximum<T>
 {};
 
-// Maximum with nan propagation
-// To propagate NANs, the "max" of a two element that contains NaNs should also return a NaN
-template <typename T>
-struct maximum<T, true> {
-  CUTLASS_HOST_DEVICE
-  T operator()(T const &lhs, T const &rhs) const {
-#if defined(__CUDA_ARCH__)
-    return lhs > rhs or ::isnan(lhs) ? lhs : rhs;
-#else
-    return lhs > rhs or std::isnan(lhs) ? lhs : rhs;
-#endif
-  }
-};
-
 template <>
 struct maximum<float, false> {
   CUTLASS_HOST_DEVICE
@@ -390,16 +387,16 @@ struct maximum<float, false> {
 template <>
 struct maximum<float, true> {
   CUTLASS_HOST_DEVICE
-  float operator()(float const lhs, float const rhs) const {
-    float res;
+  float operator()(float lhs, float rhs) const {
 #if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800)
+    float res;
     asm volatile("max.NaN.f32 %0, %1, %2;\n" : "=f"(res) : "f"(lhs), "f"(rhs));
-#elif defined(__CUDA_ARCH__)
-    res = lhs > rhs or ::isnan(lhs) ? lhs : rhs;
+    return res;
 #else
-    res = lhs > rhs or std::isnan(lhs) ? lhs : rhs;
+    using CUTLASS_CMATH_NAMESPACE :: isnan;
+
+    return lhs > rhs || isnan(lhs) ? lhs : rhs;
 #endif
-    return res;
   }
 };
 
@@ -418,22 +415,17 @@ template <typename T>
 using maximum_with_nan_propogation = maximum_with_nan_propagation<T>;
 
 template <typename T, bool PropagateNaN = false>
-struct minimum{
+struct minimum {
   CUTLASS_HOST_DEVICE
   T operator()(T const &lhs, T const &rhs) const {
-    return (rhs < lhs ? rhs : lhs);
-  }
-};
+    if constexpr (PropagateNaN && cutlass::platform::is_floating_point<T>::value) {
+      using CUTLASS_CMATH_NAMESPACE :: isnan;
 
-template <typename T>
-struct minimum<T, true> {
-  CUTLASS_HOST_DEVICE
-  T operator()(T const &lhs, T const &rhs) const {
-#if defined(__CUDA_ARCH__)
-    return lhs < rhs or ::isnan(lhs) ? lhs : rhs;
-#else
-    return lhs < rhs or std::isnan(lhs) ? lhs : rhs;
-#endif
+      return lhs < rhs || isnan(lhs) ? lhs : rhs;
+    }
+    else {
+      return (rhs < lhs ? rhs : lhs);
+    }
   }
 };
 
@@ -445,6 +437,21 @@ struct minimum<float, false> {
   }
 };
 
+template <>
+struct minimum<float, true> {
+  CUTLASS_HOST_DEVICE
+  float operator()(float lhs, float rhs) const {
+#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 800)
+    float res;
+    asm volatile("min.NaN.f32 %0, %1, %2;\n" : "=f"(res) : "f"(lhs), "f"(rhs));
+    return res;
+#else
+    // No need for ADL; call std::isnan(float) on host and ::isnan(float) on device.
+    return lhs < rhs || (CUTLASS_CMATH_NAMESPACE :: isnan(lhs)) ? lhs : rhs;
+#endif
+  }
+};
+
 template <typename T>
 struct minimum_with_nan_propagation : minimum<T, true> 
 {};
@@ -514,6 +521,8 @@ template <typename A, typename B = A, typename C = A>
 struct guarded_multiply_add {
   CUTLASS_HOST_DEVICE
   C operator()(A const &a, B const &b, C const &c) const {
+    using CUTLASS_CMATH_NAMESPACE :: isnan;
+
     if (isnan(a) || isnan(b)) {
       return C(0);
     }
@@ -533,7 +542,10 @@ struct guarded_multiply_add<half_t, half_t, half_t> {
       : "h"(*reinterpret_cast<uint16_t const*>(&a)), "h"(*reinterpret_cast<uint16_t const*>(&b)), "h"(*reinterpret_cast<uint16_t const*>(&c)));
     return result;
 #else
-    if (isnan(a) || isnan(b)) {
+    // Namespace-qualifying isnan as cutlass::isnan saves the compiler
+    // the trouble of argument-dependent lookup.  Calling std::isnan or
+    // ::isnan here would result in unwanted implicit conversion to float.
+    if (cutlass::isnan(a) || cutlass::isnan(b)) {
       return half_t(0);
     }
     return a * b + c;
@@ -546,13 +558,9 @@ template <typename A, typename B = A, typename C = A>
 struct guarded_multiply_add_relu0 {
   CUTLASS_HOST_DEVICE
   C operator()(A const &a, B const &b, C const &c) const {
-    if (
-#if defined(__CUDA_ARCH__)
-         ::isnan(a) ||    ::isnan(b)
-#else
-      std::isnan(a) || std::isnan(b)
-#endif
-    ) {
+    using CUTLASS_CMATH_NAMESPACE :: isnan;
+
+    if (isnan(a) || isnan(b)) {
       return C(0);
     }
     maximum<C> mx;
@@ -571,13 +579,7 @@ struct guarded_multiply_add_relu0<half_t, half_t, half_t> {
       : "h"(*reinterpret_cast<uint16_t const*>(&a)), "h"(*reinterpret_cast<uint16_t const*>(&b)), "h"(*reinterpret_cast<uint16_t const*>(&c)));
     return result;
 #else
-    if (
-#if defined(__CUDA_ARCH__)
-         ::isnan(a) ||    ::isnan(b)
-#else
-      std::isnan(a) || std::isnan(b)
-#endif
-    ) {
+    if (cutlass::isnan(a) || cutlass::isnan(b)) {
       return half_t(0);
     }
     maximum<half_t> mx;
@@ -784,6 +786,10 @@ struct atomic_add
   {
 #if defined(__CUDA_ARCH__) || defined(__SYCL_DEVICE_ONLY__)
     atomicAdd(ptr, data);
+#else
+    CUTLASS_UNUSED(ptr);
+    CUTLASS_UNUSED(data);
+    CUTLASS_NOT_IMPLEMENTED();
 #endif
   }
 };
@@ -795,8 +801,9 @@ struct atomic_add<double>
   void operator()(double *ptr, const double &data)
   {
 #if !defined(__CUDA_ARCH__)
-      CUTLASS_UNUSED(ptr);
-      CUTLASS_UNUSED(data);
+    CUTLASS_UNUSED(ptr);
+    CUTLASS_UNUSED(data);
+    CUTLASS_NOT_IMPLEMENTED();
 #elif (__CUDA_ARCH__ >= 600)
     atomicAdd(ptr, data);
 #else
@@ -823,6 +830,7 @@ struct atomic_add<half2>
 #if !defined(__CUDA_ARCH__) || (defined(__CUDA_ARCH__)  && (__CUDA_ARCH__ < 600))
       CUTLASS_UNUSED(ptr);
       CUTLASS_UNUSED(data);
+      CUTLASS_NOT_IMPLEMENTED();
 #else
     // Vector-2 atomic reduction requires .target sm_60 or higher
     uint32_t word = reinterpret_cast<const uint32_t&>(data);
@@ -880,7 +888,6 @@ struct is_atomic<atomic_add<T>> : platform::true_type {};
 template <class T>
 struct is_atomic<atomic_maximum<T>> : platform::true_type {};
 
-
 /////////////////////////////////////////////////////////////////////////////////////////////////
 //
 // Partial specializations for nvcuda::wmma::fragment<Use, m, n, k, T, Layout>
diff --git a/include/cutlass/gemm/collective/builders/sm90_common.inl b/include/cutlass/gemm/collective/builders/sm90_common.inl
index 298793e886..8d95967f97 100644
--- a/include/cutlass/gemm/collective/builders/sm90_common.inl
+++ b/include/cutlass/gemm/collective/builders/sm90_common.inl
@@ -38,6 +38,7 @@
 #include "cutlass/detail/dependent_false.hpp"
 
 #include "cute/atom/mma_traits_sm90_gmma.hpp"
+#include "cute/atom/mma_traits_sm90_gmma_sparse.hpp"
 #include "cute/atom/copy_traits_sm90_tma.hpp"
 
 /////////////////////////////////////////////////////////////////////////////////////////////////
@@ -123,13 +124,12 @@ sm90_cluster_shape_to_tma_atom(UnimodalClusterShape) {
   }
 }
 
-// Generates the most efficient possible TiledCopy with cp.async copy atom given a set of parameters.
-template<int ThreadCount, class Element, int Alignment, class StrideType, class TileMN, class TileK>
+// Generates the most efficient possible TiledCopy with simt copy atom(e.g. cp.async) given a set of parameters.
+template<class CopyAtom, int ThreadCount, int Alignment, class StrideType, class TileMN, class TileK>
 constexpr auto
-make_cp_async_gmem_tiled_copy() {
+make_simt_gmem_tiled_copy() {
   using namespace cute;
 
-  using AlignmentType = cute::uint_byte_t<static_cast<int>(sizeof(Element)) * Alignment>;
   constexpr int TileSizeMN  = cute::size(TileMN{});
   constexpr int TileSizeK   = cute::size(TileK{});
 
@@ -144,7 +144,7 @@ make_cp_async_gmem_tiled_copy() {
     static_assert(ThreadCount % threads_major == 0);
     static_assert(threads_minor == 0 || (TileSizeMN % threads_minor == 0));
     return make_tiled_copy(
-      Copy_Atom<SM80_CP_ASYNC_CACHEALWAYS<AlignmentType>, Element>{},
+      CopyAtom{},
       Layout<Shape <Int<threads_minor>,Int<threads_major>>,
              Stride<Int<threads_major>,                _1>>{},
       Layout<Shape<_1,Int<Alignment>>>{});
@@ -157,13 +157,12 @@ make_cp_async_gmem_tiled_copy() {
     static_assert(ThreadCount % threads_major == 0);
     static_assert(threads_minor == 0 || (TileSizeK % threads_minor == 0));
     return make_tiled_copy(
-      Copy_Atom<SM80_CP_ASYNC_CACHEALWAYS<AlignmentType>, Element>{},
+      CopyAtom{},
       Layout<Shape <Int<threads_major>,Int<threads_minor>>,
              Stride<                _1,Int<threads_major>>>{},
       Layout<Shape<Int<Alignment>,_1>>{});
-  }
-  else {
-    static_assert(cute::is_void_v<Element>, "Unsupported gmem layout for automatic gmem tiled copy builder.");
+  } else {
+    static_assert(cute::is_void_v<CopyAtom>, "Unsupported gmem layout for automatic gmem tiled copy builder.");
   }
 }
 
@@ -319,6 +318,62 @@ ss_smem_selector()
   }
 }
 
+// Helper for SS GMMA smem selection that considers a tensor TileShape:
+//   (BLK_MN, BLK_K)
+//   or hierarchically
+//   ((BLK_MN0,BLK_MN1,...),(BLK_K0,BLK_K1,...))
+//   and returns the largest GMMA::Layout that fits BLK_MN0 and BLK_K0
+template <cute::GMMA::Major major, class ElementType, class BLK_MN, class BLK_K, class Sparsity>
+CUTE_HOST_DEVICE constexpr
+auto
+ss_smem_selector_sparse()
+{
+  using namespace cute;
+
+  auto BLK_MN0 = size<0>(BLK_MN{});
+  auto BLK_K0  = size<0>(BLK_K{});
+
+  static_assert(BLK_MN0 % 8 == 0, "BLK_MN0 must be a multiple of 8.");
+  static_assert(BLK_K0 % 8 == 0,  "BLK_K0 must be a multiple of 8.");
+
+  if constexpr (major == GMMA::Major::MN) {
+    if constexpr (BLK_MN0 % size<0>(GMMA::Layout_MN_SW128_SpAtom<ElementType, Sparsity{}>{}) == 0) {
+      return GMMA::Layout_MN_SW128_SpAtom<ElementType, Sparsity{}>{};
+    }
+    else if constexpr (BLK_MN0 % size<0>(GMMA::Layout_MN_SW64_SpAtom<ElementType, Sparsity{}>{}) == 0) {
+      return GMMA::Layout_MN_SW64_SpAtom<ElementType, Sparsity{}>{};
+    }
+    else if constexpr (BLK_MN0 % size<0>(GMMA::Layout_MN_SW32_SpAtom<ElementType, Sparsity{}>{}) == 0) {
+      return GMMA::Layout_MN_SW32_SpAtom<ElementType, Sparsity{}>{};
+    }
+    else if constexpr (BLK_MN0 % size<0>(GMMA::Layout_MN_INTER_SpAtom<ElementType, Sparsity{}>{}) == 0) {
+      return GMMA::Layout_MN_INTER_SpAtom<ElementType, Sparsity{}>{};
+    }
+    else {
+      static_assert(BLK_MN0 % size<0>(GMMA::Layout_MN_INTER_SpAtom<ElementType, Sparsity{}>{}) == 0,
+                    "BLK_MN0 must be a multiple of size<0>(GMMA::Layout_MN_INTER_Atom<ElementType>{})");
+    }
+  }
+  else if constexpr (major == GMMA::Major::K) {
+    if constexpr (BLK_K0 % size<1>(GMMA::Layout_K_SW128_SpAtom<ElementType, Sparsity{}>{}) == 0) {
+      return GMMA::Layout_K_SW128_SpAtom<ElementType, Sparsity{}>{};
+    }
+    else if constexpr (BLK_K0 % size<1>(GMMA::Layout_K_SW64_SpAtom<ElementType, Sparsity{}>{}) == 0) {
+      return GMMA::Layout_K_SW64_SpAtom<ElementType, Sparsity{}>{};
+    }
+    else if constexpr (BLK_K0 % size<1>(GMMA::Layout_K_SW32_SpAtom<ElementType, Sparsity{}>{}) == 0) {
+      return GMMA::Layout_K_SW32_SpAtom<ElementType, Sparsity{}>{};
+    }
+    else if constexpr (BLK_K0 % size<1>(GMMA::Layout_K_INTER_SpAtom<ElementType, Sparsity{}>{}) == 0) {
+      return GMMA::Layout_K_INTER_SpAtom<ElementType, Sparsity{}>{};
+    }
+    else {
+      static_assert(BLK_K0 % size<1>(GMMA::Layout_K_INTER_SpAtom<ElementType, Sparsity{}>{}) == 0,
+                    "BLK_K0 must be a multiple of size<1>(GMMA::Layout_K_INTER_Atom<ElementType>{})");
+    }
+  }
+}
+
 template <class ElementA, class ElementB>
 constexpr bool
 is_input_size_two_bytes() {
diff --git a/include/cutlass/gemm/collective/builders/sm90_gmma_builder.inl b/include/cutlass/gemm/collective/builders/sm90_gmma_builder.inl
index 25b1f84832..8657aad2b7 100644
--- a/include/cutlass/gemm/collective/builders/sm90_gmma_builder.inl
+++ b/include/cutlass/gemm/collective/builders/sm90_gmma_builder.inl
@@ -31,6 +31,10 @@
 #pragma once
 
 #include "cutlass/gemm/collective/builders/sm90_common.inl"
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/pipeline/sm90_pipeline.hpp"
+#include "cutlass/gemm/collective/collective_mma_decl.hpp"
+#include "cutlass/gemm/collective/collective_builder_decl.hpp"
 
 // SM90 Collective Builders should be used only starting CUDA 12.0
 #if (__CUDACC_VER_MAJOR__ >= 12)
@@ -45,21 +49,21 @@ namespace cutlass::gemm::collective {
 
 namespace detail {
 
-// Returns the maximum number of smem tiles that can be used with a given smem capacity, or overrides with manual count. 
+// Returns the maximum number of smem tiles that can be used with a given smem capacity, or overrides with manual count.
 template<int CapacityBytes, class ElementA, class ElementB, class TileShapeMNK, int stages>
 constexpr int
 compute_stage_count_or_override(StageCount<stages> stage_count) {
   return stages;
 }
 
-// Returns the maximum number of smem tiles that can be used with a given smem capacity, or overrides with manual count. 
+// Returns the maximum number of smem tiles that can be used with a given smem capacity, or overrides with manual count.
 template<int CapacityBytes, class ElementA, class ElementB, class TileShapeMNK, int stages>
 constexpr int
 compute_stage_count_or_override(cute::Int<stages> stage_count) {
   return stages;
 }
 
-// Returns the maximum number of smem tiles that can be used with a given smem capacity, or overrides with manual count. 
+// Returns the maximum number of smem tiles that can be used with a given smem capacity, or overrides with manual count.
 template<int CapacityBytes, class ElementA, class ElementB, class TileShapeMNK, int carveout_bytes>
 constexpr int
 compute_stage_count_or_override(StageCountAutoCarveout<carveout_bytes> stage_count) {
@@ -74,7 +78,7 @@ compute_stage_count_or_override(StageCountAutoCarveout<carveout_bytes> stage_cou
   return (CapacityBytes - carveout_bytes) / stage_bytes;
 }
 
-// Returns the maximum number of smem tiles that can be used with a given smem capacity (with an optional scale matrix), or overrides with manual count. 
+// Returns the maximum number of smem tiles that can be used with a given smem capacity (with an optional scale matrix), or overrides with manual count.
 template<int CapacityBytes, class ElementA, class ElementB, class ElementScale, class ElementZero, class TileShapeMNK, int stages>
 constexpr int
 compute_stage_count_or_override_single_affine_transformed_input(StageCount<stages> stage_count) {
@@ -82,16 +86,16 @@ compute_stage_count_or_override_single_affine_transformed_input(StageCount<stage
 }
 
 template <class Element>
-constexpr int get_bits_for_possibly_void_element() { 
+constexpr int get_bits_for_possibly_void_element() {
   if constexpr (cute::is_same_v<Element, void>) {
     return 0;
-  } 
+  }
   else {
     return sizeof_bits<Element>::value;
   }
 }
 
-// Returns the maximum number of smem tiles that can be used with a given smem capacity (with an optional scale matrix), or overrides with manual count. 
+// Returns the maximum number of smem tiles that can be used with a given smem capacity (with an optional scale matrix), or overrides with manual count.
 template<int CapacityBytes, class ElementA, class ElementB, class ElementScale, class ElementZero, class TileShapeMNK, int carveout_bytes>
 constexpr int
 compute_stage_count_or_override_single_affine_transformed_input(StageCountAutoCarveout<carveout_bytes> stage_count) {
@@ -109,7 +113,7 @@ compute_stage_count_or_override_single_affine_transformed_input(StageCountAutoCa
   static_assert(scale_bytes % 128 == 0, "Scale bytes must be a multiple of 128");
   static_assert(zero_bytes  % 128 == 0, "Zero bytes must be a multiple of 128");
 
-  // When scales are void, s_bits will be 0 so no smem will be allocated for scales. 
+  // When scales are void, s_bits will be 0 so no smem will be allocated for scales.
   constexpr int stage_bytes =
     cutlass::bits_to_bytes(a_bits * size<0>(TileShapeMNK{}) * size<2>(TileShapeMNK{})) +
     cutlass::bits_to_bytes(b_bits * size<1>(TileShapeMNK{}) * size<2>(TileShapeMNK{})) +
@@ -136,7 +140,7 @@ is_warpspecialized_transpose_B(){
                                   cutlass::gemm::detail::is_mn_major_B<LayoutB>();
   constexpr bool IsWarpSpecialized = cute::is_base_of_v<KernelTmaWarpSpecialized, KernelScheduleType>                ||
                                      cute::is_base_of_v<KernelTmaWarpSpecializedPingpong, KernelScheduleType>        ||
-                                     cute::is_base_of_v<KernelTmaWarpSpecializedCooperative, KernelScheduleType>     || 
+                                     cute::is_base_of_v<KernelTmaWarpSpecializedCooperative, KernelScheduleType>     ||
                                      cute::is_base_of_v<KernelCpAsyncWarpSpecialized, KernelScheduleType>            ||
                                      cute::is_base_of_v<KernelCpAsyncWarpSpecializedPingpong, KernelScheduleType>    ||
                                      cute::is_base_of_v<KernelCpAsyncWarpSpecializedCooperative, KernelScheduleType>;
@@ -177,10 +181,12 @@ struct CollectiveBuilder<
     StageCountType,
     KernelScheduleType,
     cute::enable_if_t<
-      (cute::is_same_v<KernelScheduleType, KernelTmaWarpSpecialized> ||
-       cute::is_same_v<KernelScheduleType, KernelTmaWarpSpecializedPingpong> ||
-       cute::is_same_v<KernelScheduleType, KernelTmaWarpSpecializedCooperative> ||
-       cute::is_same_v<KernelScheduleType, KernelPtrArrayTmaWarpSpecializedCooperative>) &&
+      (cute::is_any_of_v<KernelScheduleType,
+                         KernelTmaWarpSpecialized,
+                         KernelTmaWarpSpecializedCooperative,
+                         KernelTmaWarpSpecializedPingpong,
+                         KernelPtrArrayTmaWarpSpecializedCooperative,
+                         KernelPtrArrayTmaWarpSpecializedPingpong>) &&
        not detail::is_use_rmem_A<ElementA, GmemLayoutATag, ElementB, GmemLayoutBTag>()>
 > {
   static_assert(is_static<TileShape_MNK>::value);
@@ -191,10 +197,12 @@ struct CollectiveBuilder<
   static_assert(detail::is_aligned<ElementA, AlignmentA, ElementB, AlignmentB, detail::tma_alignment_bytes>(),
                 "Should meet TMA alignment requirement\n");
 
-  static constexpr bool IsArrayOfPointersGemm = (cute::is_same_v<KernelScheduleType, KernelPtrArrayTmaWarpSpecializedCooperative>);
+  static constexpr bool IsArrayOfPointersGemm = (cute::is_any_of_v<KernelScheduleType,
+                                                                   KernelPtrArrayTmaWarpSpecializedCooperative,
+                                                                   KernelPtrArrayTmaWarpSpecializedPingpong>);
   static constexpr bool IsFP8Input = detail::is_input_fp8<ElementA, ElementB>();
   static_assert(!IsFP8Input || (IsFP8Input && !IsArrayOfPointersGemm),
-                "Kernel[Array/Group]TmaWarpSpecializedCooperative is only compatible with FP8 FastAccum version right now\n");
+                "KernelPtrArrayTmaWarpSpecialized[Cooperative|Pingpong] is only compatible with FP8 FastAccum version right now.");
 
   // For fp32 types, map to tf32 MMA value type
   using ElementAMma = cute::conditional_t<cute::is_same_v<ElementA, float>, tfloat32_t, ElementA>;
@@ -203,8 +211,10 @@ struct CollectiveBuilder<
   static constexpr cute::GMMA::Major GmmaMajorA = detail::gmma_ss_tag_to_major_A<ElementAMma, GmemLayoutATag>();
   static constexpr cute::GMMA::Major GmmaMajorB = detail::gmma_ss_tag_to_major_B<ElementBMma, GmemLayoutBTag>();
 
-  using AtomLayoutMNK = cute::conditional_t<
-      cute::is_same_v<KernelScheduleType, KernelTmaWarpSpecializedCooperative> || IsArrayOfPointersGemm,
+  static constexpr bool IsCooperative = cute::is_any_of_v<KernelScheduleType,
+                                                          KernelTmaWarpSpecializedCooperative,
+                                                          KernelPtrArrayTmaWarpSpecializedCooperative>;
+  using AtomLayoutMNK = cute::conditional_t<IsCooperative,
       Layout<Shape<_2,_1,_1>>, Layout<Shape<_1,_1,_1>>>;
 
   using TiledMma = decltype(cute::make_tiled_mma(cute::GMMA::ss_op_selector<
@@ -218,7 +228,10 @@ struct CollectiveBuilder<
   using SmemLayoutAtomB = decltype(detail::ss_smem_selector<
       GmmaMajorB, ElementBMma, decltype(cute::get<1>(TileShape_MNK{})), decltype(cute::get<2>(TileShape_MNK{}))>());
 
-  static constexpr int PipelineStages = detail::compute_stage_count_or_override<detail::sm90_smem_capacity_bytes,
+  static constexpr size_t TensorMapStorage = IsArrayOfPointersGemm ? sizeof(cute::TmaDescriptor) * 2 /* for A and B */ : 0;
+  static constexpr int KernelSmemCarveout = static_cast<int>(TensorMapStorage);
+
+  static constexpr int PipelineStages = detail::compute_stage_count_or_override<detail::sm90_smem_capacity_bytes - KernelSmemCarveout,
       ElementAMma, ElementBMma, TileShape_MNK>(StageCountType{});
   using DispatchPolicy = cute::conditional_t<IsArrayOfPointersGemm,
       MainloopSm90ArrayTmaGmmaWarpSpecialized<PipelineStages, ClusterShape_MNK, KernelScheduleType>,
@@ -227,8 +240,8 @@ struct CollectiveBuilder<
           MainloopSm90TmaGmmaWarpSpecializedFP8<PipelineStages, ClusterShape_MNK, KernelScheduleType>,
           MainloopSm90TmaGmmaWarpSpecialized<PipelineStages, ClusterShape_MNK, KernelScheduleType>>>;
 
-  using SmemCopyAtomA = void; 
-  using SmemCopyAtomB = void; 
+  using SmemCopyAtomA = void;
+  using SmemCopyAtomB = void;
 
   using CollectiveOp = CollectiveMma<
       DispatchPolicy,
@@ -283,7 +296,7 @@ struct CollectiveBuilder<
       (cute::is_same_v<KernelScheduleType, KernelTmaWarpSpecialized> ||
        cute::is_same_v<KernelScheduleType, KernelTmaWarpSpecializedPingpong> ||
        cute::is_same_v<KernelScheduleType, KernelTmaWarpSpecializedCooperative>) &&
-      detail::is_use_rmem_A<ElementA, GmemLayoutATag, ElementB, GmemLayoutBTag>()> 
+      detail::is_use_rmem_A<ElementA, GmemLayoutATag, ElementB, GmemLayoutBTag>()>
 > {
   static_assert(is_static<TileShape_MNK>::value);
   static_assert(is_static<ClusterShape_MNK>::value);
@@ -322,8 +335,8 @@ struct CollectiveBuilder<
   using DispatchPolicy = MainloopSm90TmaGmmaRmemAWarpSpecialized<
       PipelineStages, ClusterShape_MNK, KernelScheduleType>;
 
-  using SmemCopyAtomA = cute::conditional_t<SwapAB, void, Copy_Atom<cute::DefaultCopy, ElementA>>;
-  using SmemCopyAtomB = cute::conditional_t<SwapAB, Copy_Atom<cute::DefaultCopy, ElementB>, void>;
+  using SmemCopyAtomA = cute::conditional_t<SwapAB, void, Copy_Atom<cute::AutoVectorizingCopy, ElementA>>;
+  using SmemCopyAtomB = cute::conditional_t<SwapAB, Copy_Atom<cute::AutoVectorizingCopy, ElementB>, void>;
 
   using CollectiveOp = CollectiveMma<
       DispatchPolicy,
@@ -391,13 +404,23 @@ public:
   using ElementA = detail::deduce_mixed_width_dtype_t<0, ElementPairA_>;
   using ElementB = detail::deduce_mixed_width_dtype_t<0, ElementPairB_>;
   static_assert(cute::is_tuple<ElementPairA_>::value ^ cute::is_tuple<ElementPairB_>::value ||
-               (NeitherIsTuple && (sizeof_bits<ElementA>::value != sizeof_bits<ElementB>::value)), 
+               (NeitherIsTuple && (sizeof_bits<ElementA>::value != sizeof_bits<ElementB>::value)),
     "Either A OR B must be a tuple or the widths of A and B must be different.");
 
   static constexpr bool IsANarrow = sizeof_bits<ElementA>::value < sizeof_bits<ElementB>::value;
 
-  using GmemLayoutATag = GmemLayoutATag_;
-  using GmemLayoutBTag = GmemLayoutBTag_;
+  template<class T>
+  static auto get_stride(T const& t) {
+    if constexpr (not cute::is_layout<T>::value) {
+      return t;
+    }
+    else {
+      return cute::stride(t);
+    }
+  }
+
+  using GmemLayoutATag = decltype(get_stride(GmemLayoutATag_{}));
+  using GmemLayoutBTag = decltype(get_stride(GmemLayoutBTag_{}));
 
   using ElementPairA = cute::conditional_t<IsANarrow && NeitherIsTuple, cute::tuple<ElementA>, ElementPairA_>;
   using ElementPairB = cute::conditional_t<!IsANarrow && NeitherIsTuple, cute::tuple<ElementB>, ElementPairB_>;
@@ -445,14 +468,14 @@ public:
   static constexpr int PipelineStages = detail::compute_stage_count_or_override_single_affine_transformed_input<detail::sm90_smem_capacity_bytes,
       RealElementA, RealElementB, ElementScale, ElementZero, TileShape_MNK>(StageCountType{});
 
-  using SmemCopyAtomA = cute::conditional_t<SwapAB, void, Copy_Atom<cute::DefaultCopy, ElementA>>;
-  using SmemCopyAtomB = cute::conditional_t<SwapAB, Copy_Atom<cute::DefaultCopy, ElementB>, void>;
+  using SmemCopyAtomA = cute::conditional_t<SwapAB, void, Copy_Atom<cute::AutoVectorizingCopy, ElementA>>;
+  using SmemCopyAtomB = cute::conditional_t<SwapAB, Copy_Atom<cute::AutoVectorizingCopy, ElementB>, void>;
 
   using DispatchPolicy = MainloopSm90TmaGmmaRmemAWarpSpecializedMixedInput<PipelineStages, ClusterShape_MNK, KernelScheduleType>;
 
   // We pack the scale data with the operand that will be optionally scaled and converted before MMA.
-  using StrideA = TagToStrideA_t<GmemLayoutATag>;
-  using StrideB = TagToStrideB_t<GmemLayoutBTag>;
+  using StrideA = cute::conditional_t<cute::is_layout<GmemLayoutATag_>::value, GmemLayoutATag_, TagToStrideA_t<GmemLayoutATag>>;
+  using StrideB = cute::conditional_t<cute::is_layout<GmemLayoutBTag_>::value, GmemLayoutBTag_, TagToStrideB_t<GmemLayoutBTag>>;
 
   using CollectiveOp = CollectiveMma<
       DispatchPolicy,
@@ -505,10 +528,12 @@ struct CollectiveBuilder<
     StageCountType,
     KernelScheduleType,
     cute::enable_if_t<
-      cute::is_same_v<KernelScheduleType, KernelTmaWarpSpecializedFP8FastAccum> ||
-      cute::is_same_v<KernelScheduleType, KernelTmaWarpSpecializedPingpongFP8FastAccum> ||
-      cute::is_same_v<KernelScheduleType, KernelTmaWarpSpecializedCooperativeFP8FastAccum> ||
-       cute::is_same_v<KernelScheduleType, KernelPtrArrayTmaWarpSpecializedCooperativeFP8FastAccum>>
+      cute::is_any_of_v<KernelScheduleType,
+                        KernelTmaWarpSpecializedFP8FastAccum,
+                        KernelTmaWarpSpecializedPingpongFP8FastAccum,
+                        KernelTmaWarpSpecializedCooperativeFP8FastAccum,
+                        KernelPtrArrayTmaWarpSpecializedCooperativeFP8FastAccum,
+                        KernelPtrArrayTmaWarpSpecializedPingpongFP8FastAccum>>
 > {
   static_assert(is_static<TileShape_MNK>::value);
   static_assert(is_static<ClusterShape_MNK>::value);
@@ -526,10 +551,15 @@ struct CollectiveBuilder<
   static constexpr cute::GMMA::Major GmmaMajorA = detail::gmma_ss_tag_to_major_A<ElementA, GmemLayoutATag>();
   static constexpr cute::GMMA::Major GmmaMajorB = detail::gmma_ss_tag_to_major_B<ElementB, GmemLayoutBTag>();
 
-  static constexpr bool IsArrayOfPointersGemm = (cute::is_same_v<KernelScheduleType, KernelPtrArrayTmaWarpSpecializedCooperativeFP8FastAccum>);
-  using AtomLayoutMNK = cute::conditional_t<cute::is_same_v<KernelScheduleType, KernelTmaWarpSpecializedCooperativeFP8FastAccum> ||
-                                            IsArrayOfPointersGemm,
-      Layout<Shape<_2,_1,_1>>, Layout<Shape<_1,_1,_1>>>;
+  static constexpr bool IsArrayOfPointersGemm = cute::is_any_of_v<KernelScheduleType,
+                                                                   KernelPtrArrayTmaWarpSpecializedCooperativeFP8FastAccum,
+                                                                   KernelPtrArrayTmaWarpSpecializedPingpongFP8FastAccum>;
+
+  static constexpr bool IsCooperative = cute::is_any_of_v<KernelScheduleType,
+                                                          KernelTmaWarpSpecializedCooperativeFP8FastAccum,
+                                                          KernelPtrArrayTmaWarpSpecializedCooperativeFP8FastAccum>;
+
+  using AtomLayoutMNK = cute::conditional_t<IsCooperative, Layout<Shape<_2,_1,_1>>, Layout<Shape<_1,_1,_1>>>;
 
   using TiledMma = decltype(cute::make_tiled_mma(cute::GMMA::ss_op_selector<
       ElementA, ElementB, ElementAccumulator, TileShape_MNK, GmmaMajorA, GmmaMajorB>(), AtomLayoutMNK{}));
@@ -542,7 +572,11 @@ struct CollectiveBuilder<
   using SmemLayoutAtomB = decltype(detail::ss_smem_selector<
       GmmaMajorB, ElementB, decltype(cute::get<1>(TileShape_MNK{})), decltype(cute::get<2>(TileShape_MNK{}))>());
 
-  static constexpr int PipelineStages = detail::compute_stage_count_or_override<detail::sm90_smem_capacity_bytes,
+  static constexpr size_t TensorMapStorage = IsArrayOfPointersGemm ? sizeof(cute::TmaDescriptor) * 2 /* for A and B */ : 0;
+  static constexpr int KernelSmemCarveout = static_cast<int>(TensorMapStorage);
+  static constexpr int Sm90ReducedSmemCapacityBytes = detail::sm90_smem_capacity_bytes - KernelSmemCarveout;
+
+  static constexpr int PipelineStages = detail::compute_stage_count_or_override<Sm90ReducedSmemCapacityBytes,
       ElementA, ElementB, TileShape_MNK>(StageCountType{});
   using DispatchPolicy = cute::conditional_t<IsArrayOfPointersGemm,
       MainloopSm90ArrayTmaGmmaWarpSpecialized<PipelineStages, ClusterShape_MNK, KernelScheduleType>,
@@ -770,11 +804,16 @@ struct CollectiveBuilder<
 
   static constexpr int NumLoadWarpGroups = cute::is_same_v<KernelScheduleType, KernelCpAsyncWarpSpecialized> ? 2 : 1;
 
-  using GmemTiledCopyA = decltype(detail::make_cp_async_gmem_tiled_copy<
-      NumThreadsPerWarpGroup * NumLoadWarpGroups, ElementA, AlignmentA, TagToStrideA_t<GmemLayoutATag>,
+  using AlignmentTypeA = cute::uint_byte_t<static_cast<int>(sizeof(ElementA)) * AlignmentA>;
+  using GmemCopyAtomA = cute::Copy_Atom<SM80_CP_ASYNC_CACHEALWAYS<AlignmentTypeA>, ElementA>;
+  using GmemTiledCopyA = decltype(detail::make_simt_gmem_tiled_copy<
+      GmemCopyAtomA, NumThreadsPerWarpGroup * NumLoadWarpGroups, AlignmentA, TagToStrideA_t<GmemLayoutATag>,
       decltype(cute::get<0>(TileShape_MNK{})), decltype(cute::get<2>(TileShape_MNK{}))>());
-  using GmemTiledCopyB = decltype(detail::make_cp_async_gmem_tiled_copy<
-      NumThreadsPerWarpGroup * NumLoadWarpGroups, ElementB, AlignmentB, TagToStrideB_t<GmemLayoutBTag>,
+
+  using AlignmentTypeB = cute::uint_byte_t<static_cast<int>(sizeof(ElementB)) * AlignmentB>;
+  using GmemCopyAtomB = cute::Copy_Atom<SM80_CP_ASYNC_CACHEALWAYS<AlignmentTypeB>, ElementB>;
+  using GmemTiledCopyB = decltype(detail::make_simt_gmem_tiled_copy<
+      GmemCopyAtomB, NumThreadsPerWarpGroup * NumLoadWarpGroups, AlignmentB, TagToStrideB_t<GmemLayoutBTag>,
       decltype(cute::get<1>(TileShape_MNK{})), decltype(cute::get<2>(TileShape_MNK{}))>());
 
   using SmemLayoutAtomA = decltype(detail::ss_smem_selector<
@@ -871,14 +910,19 @@ struct CollectiveBuilder<
 
   static constexpr int NumLoadWarpGroups = 1;
 
-  using GmemTiledCopyA = decltype(detail::make_cp_async_gmem_tiled_copy<
-      NumThreadsPerWarpGroup * NumLoadWarpGroups, ElementA, AlignmentA, TagToStrideA_t<GmemLayoutATag>,
+  using AlignmentTypeA = cute::uint_byte_t<static_cast<int>(sizeof(ElementA)) * AlignmentA>;
+  using GmemCopyAtomA = cute::Copy_Atom<SM80_CP_ASYNC_CACHEALWAYS<AlignmentTypeA>, ElementA>;
+  using GmemTiledCopyA = decltype(detail::make_simt_gmem_tiled_copy<
+      GmemCopyAtomA, NumThreadsPerWarpGroup * NumLoadWarpGroups, AlignmentA, TagToStrideA_t<GmemLayoutATag>,
       decltype(cute::get<0>(TileShape_MNK{})), decltype(cute::get<2>(TileShape_MNK{}))>());
-  using GmemTiledCopyB = decltype(detail::make_cp_async_gmem_tiled_copy<
-      NumThreadsPerWarpGroup * NumLoadWarpGroups, ElementB, AlignmentB, TagToStrideB_t<GmemLayoutBTag>,
+
+  using AlignmentTypeB = cute::uint_byte_t<static_cast<int>(sizeof(ElementB)) * AlignmentB>;
+  using GmemCopyAtomB = cute::Copy_Atom<SM80_CP_ASYNC_CACHEALWAYS<AlignmentTypeB>, ElementB>;  
+  using GmemTiledCopyB = decltype(detail::make_simt_gmem_tiled_copy<
+      GmemCopyAtomB, NumThreadsPerWarpGroup * NumLoadWarpGroups, AlignmentB, TagToStrideB_t<GmemLayoutBTag>,
       decltype(cute::get<1>(TileShape_MNK{})), decltype(cute::get<2>(TileShape_MNK{}))>());
 
-  using SmemLayoutAtomA = decltype(detail::rs_smem_selector<GmmaMajorA, ElementAMma, 
+  using SmemLayoutAtomA = decltype(detail::rs_smem_selector<GmmaMajorA, ElementAMma,
       decltype(cute::get<0>(TileShape_MNK{})), decltype(cute::get<2>(TileShape_MNK{})), IsWarpSpecializedTransposeB>());
   using SmemLayoutAtomB = decltype(detail::rs_smem_selector<GmmaMajorB, ElementBMma,
       decltype(cute::get<1>(TileShape_MNK{})), decltype(cute::get<2>(TileShape_MNK{})), IsWarpSpecializedTransposeB>());
@@ -889,8 +933,8 @@ struct CollectiveBuilder<
   using DispatchPolicy = MainloopSm90CpAsyncGmmaRmemAWarpSpecialized<
       PipelineStages, ClusterShape_MNK, KernelScheduleType>;
 
-  using SmemCopyAtomA = cute::conditional_t<SwapAB, void, Copy_Atom<cute::DefaultCopy, ElementA>>;
-  using SmemCopyAtomB = cute::conditional_t<SwapAB, Copy_Atom<cute::DefaultCopy, ElementB>, void>;
+  using SmemCopyAtomA = cute::conditional_t<SwapAB, void, Copy_Atom<cute::AutoVectorizingCopy, ElementA>>;
+  using SmemCopyAtomB = cute::conditional_t<SwapAB, Copy_Atom<cute::AutoVectorizingCopy, ElementB>, void>;
 
   using CollectiveOp = CollectiveMma<
       DispatchPolicy,
@@ -1001,3 +1045,4 @@ static constexpr bool IsMixedWidthInput = IsDifferentWidth || (IsDifferentWidth
 } // namespace cutlass::gemm::collective
 
 /////////////////////////////////////////////////////////////////////////////////////////////////
+
diff --git a/include/cutlass/gemm/collective/builders/sm90_sparse_config.inl b/include/cutlass/gemm/collective/builders/sm90_sparse_config.inl
new file mode 100644
index 0000000000..f9aa7bab2d
--- /dev/null
+++ b/include/cutlass/gemm/collective/builders/sm90_sparse_config.inl
@@ -0,0 +1,268 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+/*! \file
+    \brief Sparse configs specific for SM90 structure sparse kernels
+*/
+
+
+#pragma once
+
+#include "cute/atom/mma_traits_sm90_gmma.hpp"  // cute::GMMA::Major
+#include "cute/layout.hpp"                     // cute::Layout, cute::Shape, cute::Stride
+#include "cute/numeric/integral_constant.hpp"  // cute::Int
+#include "cute/numeric/numeric_types.hpp"      // cute::sizeof_bits_v
+#include "cute/pointer_sparse.hpp"             // cute::is_sparse
+#include "cute/util/type_traits.hpp"           // cute::is_same_v, cute::conditional_t
+#include "cutlass/fast_math.h"                 // cutlass::round_up
+#include "cutlass/layout/matrix.h"             // cutlass::RowMajor, cutlass::ColumnMajor
+
+namespace cutlass {
+
+using namespace cute;
+
+template<
+  class ElementAMma_,
+  GMMA::Major GmmaMajorA,
+  class ElementEMma_,
+  class MinTileShapeK = Int<32>
+>
+struct Sm90GemmSparseConfig {
+
+  static_assert(cute::is_sparse<ElementAMma_>::value, "ElementAMma MUST be sparse elem");
+  static_assert(cute::is_sparse<ElementEMma_>::value, "ElementEMma MUST be sparse elem");
+
+  // A
+  using ElementAMma         = ElementAMma_;
+  using ElementAMmaRaw      = typename ElementAMma::raw_type;
+  using ElementAMmaSparsity = Int<ElementAMma::sparsity>;
+
+  // Metadata (E)
+  using ElementEMma         = ElementEMma_;
+  using ElementEMmaRaw      = typename ElementEMma::raw_type;
+  using ElementEMmaSparsity = Int<ElementEMma::sparsity>;
+
+  // MMA type
+  static constexpr bool IsQmma = cute::is_same_v<ElementAMmaRaw, float_e4m3_t> && ElementAMmaSparsity{} == _2{} ||
+                                  cute::is_same_v<ElementAMmaRaw, float_e5m2_t> && ElementAMmaSparsity{} == _2{};
+  static constexpr bool IsImma = cute::is_same_v<ElementAMmaRaw, int8_t> && ElementAMmaSparsity{} == _2{} ||
+                                 cute::is_same_v<ElementAMmaRaw, uint8_t> && ElementAMmaSparsity{} == _2{};
+  static constexpr bool IsHmma = cute::is_same_v<ElementAMmaRaw, half_t> && ElementAMmaSparsity{} == _2{} ||
+                                 cute::is_same_v<ElementAMmaRaw, bfloat16_t> && ElementAMmaSparsity{} == _2{};
+  static constexpr bool IsTfmma = cute::is_same_v<ElementAMmaRaw, tfloat32_t> && ElementAMmaSparsity{} == _2{} || 
+                                  cute::is_same_v<ElementAMmaRaw, float> && ElementAMmaSparsity{} == _2{};
+  static_assert(int(IsQmma) + int(IsImma) + int(IsHmma) + int(IsTfmma) == 1, "Ambigious Input Type Config (failed to choose MMA type)");
+
+  // Number of ElementARaw stored in ElementAMmaRaw. For Hopper this is always 1.
+  using ElemsARawPerElementAMmaRaw = _1;
+
+  // ElementA Sparsity Ratio
+  using ElementASparsity = ElementAMmaSparsity;
+  static_assert(ElementASparsity{} == _2{}, "ElementASparsity must be 2 for Hopper Sparse Gemm");
+
+  // Logical/Physical ElementA per Chunk
+  using LogicalElemsAPerChunk = conditional_t<IsTfmma, _2, _4>;
+  using PhysicalElemsAPerChunk = Int<LogicalElemsAPerChunk{} / ElementASparsity{}>;
+
+  // Metadata Bits
+  using ElementEBitsPerChunk = _4;
+  using ElementEBitsPerElementAMma = cute::conditional_t<IsTfmma, _4, _2>;
+
+  // Metadata Layout. Unit in corresbonding logical elements.
+  // Basic metadata block is (16,64) for 8-bit, (16,32) for 16-bit, (16,16) for 32-bit data types.
+  // https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#sparse-wgmma-metadata-64n32-f16bf16
+  // Tensor E layout atom stacks 4 basic blocks along M mode to align with WGMMA instruction shape and
+  // stacks 1-4 blocks along K mode and reorders memory layout to allow for vectorized loads from smem.
+  using BlockK = Int<512 / sizeof_bits_v<ElementAMmaRaw>>;
+  static_assert(MinTileShapeK{} % BlockK{} == 0, "MinTileShapeK must be a multiple of BlockK");
+  using NumK = decltype(MinTileShapeK{} / BlockK{});
+
+  using TensorEAtom_32bit = decltype(make_ordered_layout(Shape<Shape<_8,_2,_4>, Shape<_8,_2,NumK>>{}, 
+                                                         Step <Step <_3,_1,_5>, Step <_0,_4,  _2>>{}));
+
+  using TensorEAtom_16bit = decltype(make_ordered_layout(Shape<Shape<_8,_2,_4>, Shape<_16,_2,NumK>>{},
+                                                         Step <Step <_3,_1,_5>, Step < _0,_4,  _2>>{}));
+
+  using TensorEAtom_8bit  = decltype(make_ordered_layout(Shape<_64,MinTileShapeK>{},
+                                                         Step < _1,           _0>{}));
+
+  using TensorEAtom = cute::conditional_t<(IsQmma || IsImma),  TensorEAtom_8bit, 
+                      cute::conditional_t<IsTfmma, TensorEAtom_32bit,
+                      TensorEAtom_16bit>>;
+
+  // Logical elems that construct the atomK for tensorE/A.  
+  using TensorEAtomK = Int<size<1>(TensorEAtom{})>;
+  using TensorEAtomM = Int<size<0>(TensorEAtom{})>;
+
+  // Tensor E alignment requirements
+  using TensorEAlignmentM = TensorEAtomM;
+  using TensorEAlignmentK = TensorEAtomK;
+
+  // Tensor A alignment requirements
+  // When A is MN major, TensorAAlignmentK needs to be multiplier of chunk size
+  // When A is K major, TensorAAlignmentK needs to be multiplier of TMA requirements times tensorA sparsity
+  //   this is b.c. TensorACompressed needs to satisfy TMA requirements
+  using TensorAAlignmentK = cute::conditional_t<GmmaMajorA == GMMA::Major::MN,
+                                                LogicalElemsAPerChunk,
+                                                Int<128 / cute::sizeof_bits_v<ElementAMma>>>;
+
+  // When A is MN Major, TensorAAlignmentM needs to be multiplier of TMA requirements
+  // When A is K Major, no requirements on TensorAAlignmentM.
+  using TensorAAlignmentM = cute::conditional_t<GmmaMajorA == GMMA::Major::MN,
+                                                Int<128 / cute::sizeof_bits_v<ElementAMmaRaw> * ElemsARawPerElementAMmaRaw{}>,
+                                                _1>;
+
+  // The following two functions are provided for user determine the static layouts type
+  CUTE_HOST_DEVICE
+  static constexpr auto
+  deduce_layoutA() {
+    using LayoutMMajor = Layout<Shape <int32_t,
+                                       Shape<ElementASparsity, int32_t>,
+                                       int32_t>,
+                                Stride<ElementASparsity,
+                                       Stride<_1, int64_t>,
+                                       int64_t>>;
+
+    using LayoutKMajor = Layout<Shape <int32_t,
+                                       Shape<ElementASparsity, int32_t>,
+                                       int32_t>,
+                                Stride<int64_t,
+                                       Stride<_1, ElementASparsity>,
+                                       int64_t>>;
+
+    if constexpr (GmmaMajorA == GMMA::Major::MN) {
+      return LayoutMMajor{};
+    }
+    else {
+      return LayoutKMajor{};
+    }
+  }
+
+  CUTE_HOST_DEVICE
+  static constexpr auto
+  deduce_layoutE() {
+    return make_layout(
+      make_shape(make_shape(shape<0>(TensorEAtom{}), int32_t(0)),
+                 make_shape(shape<1>(TensorEAtom{}), int32_t(0)),
+                 int32_t(0)),
+      make_stride(make_stride(stride<0>(TensorEAtom{}), cute::Int<cute::cosize(TensorEAtom{})>{}),
+                  make_stride(stride<1>(TensorEAtom{}), int64_t(0)),
+                  int64_t(0))
+    );
+  }
+
+  // This function is used to revert a CuTe layout to a Cutlass layout tag (RowMajor/ColumnMajor)
+  template <class ShapeA, class StrideA>
+  CUTE_HOST_DEVICE
+  static constexpr auto
+  deduce_layoutA_tag(Layout<ShapeA, StrideA> layout_a) {
+    /*
+      (m, (2, k/2), l) : (2, (1, m*2), m*k) M-major
+      (m, (2, k/2), l) : (k, (1, 2), m*k) K-major
+    */
+    // Check if the given layout_a is possibly a sparse tensorA layout.
+    static_assert(rank_v<ShapeA> == 3 && depth_v<ShapeA> == 2, "Rank and depth mismatch with the sparse tensorA's layout.");
+    static_assert(rank(get<1>(ShapeA{})) == 2 && rank(flatten(ShapeA{})) == 4,
+                  "Not likely to be a sparse tensorA's layout.");
+    static_assert(get<1,0>(StrideA{}) == 1 && get<1,0>(ShapeA{}) == ElementASparsity{},
+                  "Not likely to be a sparse tensorA's layout.");
+    static_assert(get<0>(StrideA{}) == ElementASparsity{} || get<1,1>(StrideA{}) == ElementASparsity{},
+                  "Not likely to be a sparse tensorA's layout.");
+
+    if constexpr (get<0>(StrideA{}) == ElementASparsity{}) {
+      return cutlass::layout::ColumnMajor{};
+    }
+    else {
+      return  cutlass::layout::RowMajor{};
+    }
+  }
+
+  // Fill tensor A layout from dynamic problem shape
+  template <class ProblemShape>
+  CUTE_HOST_DEVICE
+  static constexpr auto
+  fill_layoutA(ProblemShape problem_shape) {
+
+    const auto [M, N, K, L] = problem_shape;
+
+    // Round up to satisfy TensorA Alignment requirement
+    const auto M_AlignedAC = cutlass::round_up(M, TensorAAlignmentM{});
+    const auto K_AlignedAC = cutlass::round_up(K, TensorAAlignmentK{});
+
+    if constexpr (GmmaMajorA == GMMA::Major::MN) {
+      return make_layout(
+        make_shape(int32_t(M_AlignedAC),
+                   make_shape(ElementASparsity{}, int32_t(K_AlignedAC) / ElementASparsity{}),
+                   int32_t(L)),
+        make_stride(ElementASparsity{},
+                    make_stride(_1{}, int64_t(M_AlignedAC) * ElementASparsity{}),
+                    (L == 1) ? int64_t(0) : int64_t(M_AlignedAC * K_AlignedAC))
+      );
+    }
+    else {
+      return make_layout(
+        make_shape(int32_t(M_AlignedAC),
+                   make_shape(ElementASparsity{}, int32_t(K_AlignedAC / ElementASparsity{})),
+                   int32_t(L)),
+        make_stride(int64_t(K_AlignedAC),
+                    make_stride(_1{}, ElementASparsity{}),
+                    (L == 1) ? int64_t(0) : int64_t(M_AlignedAC * K_AlignedAC))
+      );
+    }
+  }
+
+  // Fill tensor E layout from dynamic problem shape
+  template <class ProblemShape>
+  CUTE_HOST_DEVICE
+  static constexpr auto
+  fill_layoutE(ProblemShape problem_shape) {
+    const auto [M, N, K, L] = problem_shape;
+
+    // Round up to satisfy TensorEAlignment requirement
+    const auto M_AlignedE = cutlass::round_up(M, TensorEAlignmentM{});
+    const auto K_AlignedE = cutlass::round_up(K, TensorEAlignmentK{});
+
+    // TensorEAtom first along m-dim, then along k-dim, then along batch
+    static_assert(TensorEAlignmentM{} == TensorEAtomM{}, "Shape below assumes TensorEAlignmentM == TensorEAtomM");
+    static_assert(TensorEAlignmentK{} == TensorEAtomK{}, "Shape below assumes TensorEAlignmentK == TensorEAtomK");
+
+    return make_layout(
+      make_shape(make_shape(shape<0>(TensorEAtom{}), int32_t(M_AlignedE / TensorEAtomM{})),
+                 make_shape(shape<1>(TensorEAtom{}), int32_t(K_AlignedE / TensorEAtomK{})),
+                 int32_t(L)),
+      make_stride(make_stride(stride<0>(TensorEAtom{}), cute::Int<cute::cosize(TensorEAtom{})>{}),
+                  make_stride(stride<1>(TensorEAtom{}), int64_t(M_AlignedE * TensorEAtomK{})),
+                  (L == 1) ? int64_t(0) : int64_t(M_AlignedE * K_AlignedE))
+    );
+  }
+};
+
+} // namespace cutlass
diff --git a/include/cutlass/gemm/collective/builders/sm90_sparse_gmma_builder.inl b/include/cutlass/gemm/collective/builders/sm90_sparse_gmma_builder.inl
new file mode 100644
index 0000000000..9b608fe022
--- /dev/null
+++ b/include/cutlass/gemm/collective/builders/sm90_sparse_gmma_builder.inl
@@ -0,0 +1,388 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+#pragma once
+
+#include "cutlass/gemm/collective/builders/sm90_common.inl"
+#include "cutlass/gemm/collective/builders/sm90_sparse_config.inl"
+
+// SM90 Collective Builders should be used only starting CUDA 12.0
+#if (__CUDACC_VER_MAJOR__ >= 12)
+#define CUTLASS_SM90_COLLECTIVE_BUILDER_SUPPORTED
+#endif
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+namespace cutlass::gemm::collective {
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+namespace detail {
+
+// Returns the maximum number of smem tiles that can be used with a given smem capacity, or overrides with manual count. 
+template<int CapacityBytes, class ElementAMma, class ElementB, class ElementEMma, class TileShapeMNK, int stages>
+constexpr int
+compute_stage_count_or_override_sparse(StageCount<stages> stage_count) {
+  return stages;
+}
+
+// Returns the maximum number of smem tiles that can be used with a given smem capacity, or overrides with manual count. 
+template<int CapacityBytes, class ElementAMma, class ElementB, class ElementEMma, class TileShapeMNK, int stages>
+constexpr int
+compute_stage_count_or_override_sparse(cute::Int<stages> stage_count) {
+  return stages;
+}
+
+// Returns the maximum number of smem tiles that can be used with a given smem capacity, or overrides with manual count. 
+template<int CapacityBytes, class ElementAMma, class ElementB, class ElementEMma, class TileShapeMNK, int carveout_bytes>
+constexpr int
+compute_stage_count_or_override_sparse(StageCountAutoCarveout<carveout_bytes> stage_count) {
+  constexpr auto mainloop_pipeline_bytes = sizeof(typename cutlass::PipelineTmaAsync<1>::SharedStorage);
+  constexpr auto a_bits = cute::sizeof_bits_v<ElementAMma>;
+  constexpr auto b_bits = cute::sizeof_bits_v<ElementB>;
+  constexpr auto e_bits = cute::sizeof_bits_v<ElementEMma>;
+  constexpr int stage_bytes =
+    cutlass::bits_to_bytes(a_bits * size<0>(TileShapeMNK{}) * size<2>(TileShapeMNK{})) +
+    cutlass::bits_to_bytes(b_bits * size<1>(TileShapeMNK{}) * size<2>(TileShapeMNK{})) +
+    cutlass::bits_to_bytes(e_bits * size<0>(TileShapeMNK{}) * size<2>(TileShapeMNK{})) +
+    static_cast<int>(mainloop_pipeline_bytes);
+
+  return (CapacityBytes - carveout_bytes) / stage_bytes;
+}
+
+} // namespace detail
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// GMMA_TMA_WS_SS_SPARSE
+template <
+  class ElementA,
+  class GmemLayoutATag,
+  int AlignmentA,
+  class ElementB,
+  class GmemLayoutBTag,
+  int AlignmentB,
+  class ElementAccumulator,
+  class TileShape_MNK,
+  class ClusterShape_MNK,
+  class StageCountType,
+  class KernelScheduleType
+>
+struct CollectiveBuilder<
+    arch::Sm90,
+    arch::OpClassSparseTensorOp,
+    ElementA,
+    GmemLayoutATag,
+    AlignmentA,
+    ElementB,
+    GmemLayoutBTag,
+    AlignmentB,
+    ElementAccumulator,
+    TileShape_MNK,
+    ClusterShape_MNK,
+    StageCountType,
+    KernelScheduleType,
+    cute::enable_if_t<
+      (cute::is_same_v<KernelScheduleType, KernelTmaWarpSpecialized> ||
+       cute::is_same_v<KernelScheduleType, KernelTmaWarpSpecializedPingpong> ||
+       cute::is_same_v<KernelScheduleType, KernelTmaWarpSpecializedCooperative>) &&
+       not detail::is_use_rmem_A<ElementA, GmemLayoutATag, ElementB, GmemLayoutBTag>()>
+> {
+  static_assert(is_static<TileShape_MNK>::value);
+  static_assert(is_static<ClusterShape_MNK>::value);
+#ifndef CUTLASS_SM90_COLLECTIVE_BUILDER_SUPPORTED
+  static_assert(cutlass::detail::dependent_false<ElementA>, "Unsupported Toolkit for SM90 Collective Builder\n");
+#endif
+  static_assert(detail::is_aligned<ElementA, AlignmentA, ElementB, AlignmentB, detail::tma_alignment_bytes>(),
+                "Should meet TMA alignment requirement\n");
+
+  static constexpr bool IsFP8Input = detail::is_input_fp8<ElementA, ElementB>();
+  static_assert(!IsFP8Input, "FP8 sparse collective currently only supports FastAccum schedules");
+
+  // For fp32 types, map to tf32 MMA value type
+  using ElementAMmaRaw = cute::conditional_t<cute::is_same_v<ElementA, float>, tfloat32_t, ElementA>;
+  using ElementBMma    = cute::conditional_t<cute::is_same_v<ElementB, float>, tfloat32_t, ElementB>;
+
+  static constexpr cute::GMMA::Major GmmaMajorA = detail::gmma_ss_tag_to_major_A<ElementAMmaRaw, GmemLayoutATag>();
+  static constexpr cute::GMMA::Major GmmaMajorB = detail::gmma_ss_tag_to_major_B<ElementBMma, GmemLayoutBTag>();
+
+  using AtomLayoutMNK = cute::conditional_t<
+      cute::is_same_v<KernelScheduleType, KernelTmaWarpSpecializedCooperative>,
+      Layout<Shape<_2,_1,_1>>, Layout<Shape<_1,_1,_1>>>;
+
+  using TiledMma = decltype(cute::make_tiled_mma(cute::GMMA::ss_op_selector_sparse<
+      ElementAMmaRaw, ElementBMma, ElementAccumulator, TileShape_MNK, GmmaMajorA, GmmaMajorB>(), AtomLayoutMNK{}));
+
+  using ElementAMma = typename TiledMma::ValTypeA;
+  using ElementAMmaSparsity = Int<ElementAMma::sparsity>;
+  using ElementEMma = typename TiledMma::ValTypeE;
+  using SparseConfig = cutlass::Sm90GemmSparseConfig<ElementAMma, GmmaMajorA, ElementEMma, 
+                                                     decltype(cute::min(size<2>(TileShape_MNK{}),_128{}))>;
+
+  using LayoutA = decltype(SparseConfig::deduce_layoutA());
+  using LayoutE = decltype(SparseConfig::deduce_layoutE());
+  using LayoutPairAE = decltype(cute::make_tuple(LayoutA{}, LayoutE{}));
+
+  using GmemTiledCopyA = decltype(detail::sm90_cluster_shape_to_tma_atom(shape<1>(ClusterShape_MNK{})));
+  using GmemTiledCopyB = decltype(detail::sm90_cluster_shape_to_tma_atom(shape<0>(ClusterShape_MNK{})));
+
+  using SmemLayoutAtomA = decltype(detail::ss_smem_selector_sparse<
+      GmmaMajorA, ElementAMmaRaw, decltype(cute::get<0>(TileShape_MNK{})), decltype(cute::get<2>(TileShape_MNK{})), ElementAMmaSparsity>());
+  using SmemLayoutAtomB = decltype(detail::ss_smem_selector<
+      GmmaMajorB, ElementBMma, decltype(cute::get<1>(TileShape_MNK{})), decltype(cute::get<2>(TileShape_MNK{}))>());
+
+  static constexpr int PipelineStages = detail::compute_stage_count_or_override_sparse<detail::sm90_smem_capacity_bytes,
+      ElementAMma, ElementBMma, ElementEMma, TileShape_MNK>(StageCountType{});
+  using DispatchPolicy = MainloopSm90TmaGmmaWarpSpecializedSparse<PipelineStages, ClusterShape_MNK, KernelScheduleType>;
+
+  using SmemCopyAtomA = void; 
+  using SmemCopyAtomB = void; 
+
+  using CollectiveOp = CollectiveMma<
+      DispatchPolicy,
+      TileShape_MNK,
+      ElementA,
+      LayoutPairAE,
+      ElementB,
+      TagToStrideB_t<GmemLayoutBTag>,
+      TiledMma,
+      GmemTiledCopyA,
+      SmemLayoutAtomA,
+      SmemCopyAtomA,
+      cute::identity,
+      GmemTiledCopyB,
+      SmemLayoutAtomB,
+      SmemCopyAtomB,
+      cute::identity
+    >;
+};
+
+// GMMA_TMA_WS_SS_FP8_FAST_ACCUM_SPARSE
+template <
+  class ElementA,
+  class GmemLayoutATag,
+  int AlignmentA,
+  class ElementB,
+  class GmemLayoutBTag,
+  int AlignmentB,
+  class ElementAccumulator,
+  class TileShape_MNK,
+  class ClusterShape_MNK,
+  class StageCountType,
+  class KernelScheduleType
+>
+struct CollectiveBuilder<
+    arch::Sm90,
+    arch::OpClassSparseTensorOp,
+    ElementA,
+    GmemLayoutATag,
+    AlignmentA,
+    ElementB,
+    GmemLayoutBTag,
+    AlignmentB,
+    ElementAccumulator,
+    TileShape_MNK,
+    ClusterShape_MNK,
+    StageCountType,
+    KernelScheduleType,
+    cute::enable_if_t<
+      (cute::is_same_v<KernelScheduleType, KernelTmaWarpSpecializedFP8FastAccum> ||
+       cute::is_same_v<KernelScheduleType, KernelTmaWarpSpecializedPingpongFP8FastAccum> ||
+       cute::is_same_v<KernelScheduleType, KernelTmaWarpSpecializedCooperativeFP8FastAccum>)>
+> {
+  static_assert(is_static<TileShape_MNK>::value);
+  static_assert(is_static<ClusterShape_MNK>::value);
+  static_assert(detail::is_aligned<ElementA, AlignmentA, ElementB, AlignmentB, detail::tma_alignment_bytes>(),
+                "Should meet TMA alignment requirement\n");
+  static_assert(detail::is_input_fp8<ElementA, ElementB>(),
+                "Only FP8 datatypes are compatible with these kernel schedules\n");
+#ifndef CUTLASS_SM90_COLLECTIVE_BUILDER_SUPPORTED
+  static_assert(cutlass::detail::dependent_false<ElementA>, "Unsupported Toolkit for SM90 Collective Builder\n");
+#endif
+
+  static constexpr cute::GMMA::Major GmmaMajorA = detail::gmma_ss_tag_to_major_A<ElementA, GmemLayoutATag>();
+  static constexpr cute::GMMA::Major GmmaMajorB = detail::gmma_ss_tag_to_major_B<ElementB, GmemLayoutBTag>();
+
+  using AtomLayoutMNK = cute::conditional_t<
+      cute::is_same_v<KernelScheduleType, KernelTmaWarpSpecializedCooperativeFP8FastAccum>,
+      Layout<Shape<_2,_1,_1>>, Layout<Shape<_1,_1,_1>>>;
+
+  using TiledMma = decltype(cute::make_tiled_mma(cute::GMMA::ss_op_selector_sparse<
+      ElementA, ElementB, ElementAccumulator, TileShape_MNK, GmmaMajorA, GmmaMajorB>(), AtomLayoutMNK{}));
+
+  using ElementAMma = typename TiledMma::ValTypeA;
+  using ElementAMmaSparsity = Int<ElementAMma::sparsity>;
+  using ElementEMma = typename TiledMma::ValTypeE;
+  using SparseConfig = cutlass::Sm90GemmSparseConfig<ElementAMma, GmmaMajorA, ElementEMma, 
+                                                     decltype(cute::min(size<2>(TileShape_MNK{}),_128{}))>;
+
+  using LayoutA = decltype(SparseConfig::deduce_layoutA());
+  using LayoutE = decltype(SparseConfig::deduce_layoutE());
+  using LayoutPairAE = decltype(cute::make_tuple(LayoutA{}, LayoutE{}));
+
+  using GmemTiledCopyA = decltype(detail::sm90_cluster_shape_to_tma_atom(shape<1>(ClusterShape_MNK{})));
+  using GmemTiledCopyB = decltype(detail::sm90_cluster_shape_to_tma_atom(shape<0>(ClusterShape_MNK{})));
+
+  using SmemLayoutAtomA = decltype(detail::ss_smem_selector_sparse<
+      GmmaMajorA, ElementA, decltype(cute::get<0>(TileShape_MNK{})), decltype(cute::get<2>(TileShape_MNK{})), ElementAMmaSparsity>());
+  using SmemLayoutAtomB = decltype(detail::ss_smem_selector<
+      GmmaMajorB, ElementB, decltype(cute::get<1>(TileShape_MNK{})), decltype(cute::get<2>(TileShape_MNK{}))>());
+
+  static constexpr int PipelineStages = detail::compute_stage_count_or_override_sparse<detail::sm90_smem_capacity_bytes,
+      ElementAMma, ElementB, ElementEMma, TileShape_MNK>(StageCountType{});
+  using DispatchPolicy = MainloopSm90TmaGmmaWarpSpecializedSparse<PipelineStages, ClusterShape_MNK, KernelScheduleType>;
+
+  using SmemCopyAtomA = void; 
+  using SmemCopyAtomB = void; 
+
+  using CollectiveOp = CollectiveMma<
+      DispatchPolicy,
+      TileShape_MNK,
+      ElementA,
+      LayoutPairAE,
+      ElementB,
+      TagToStrideB_t<GmemLayoutBTag>,
+      TiledMma,
+      GmemTiledCopyA,
+      SmemLayoutAtomA,
+      SmemCopyAtomA,
+      cute::identity,
+      GmemTiledCopyB,
+      SmemLayoutAtomB,
+      SmemCopyAtomB,
+      cute::identity
+    >;
+};
+
+// GMMA_TMA_WS_RS_SPARSE
+template <
+  class ElementA,
+  class GmemLayoutATag,
+  int AlignmentA,
+  class ElementB,
+  class GmemLayoutBTag,
+  int AlignmentB,
+  class ElementAccumulator,
+  class TileShape_MNK,
+  class ClusterShape_MNK,
+  class StageCountType,
+  class KernelScheduleType
+>
+struct CollectiveBuilder<
+    arch::Sm90,
+    arch::OpClassSparseTensorOp,
+    ElementA,
+    GmemLayoutATag,
+    AlignmentA,
+    ElementB,
+    GmemLayoutBTag,
+    AlignmentB,
+    ElementAccumulator,
+    TileShape_MNK,
+    ClusterShape_MNK,
+    StageCountType,
+    KernelScheduleType,
+    cute::enable_if_t<
+      (cute::is_same_v<KernelScheduleType, KernelTmaWarpSpecialized> ||
+       cute::is_same_v<KernelScheduleType, KernelTmaWarpSpecializedPingpong> ||
+       cute::is_same_v<KernelScheduleType, KernelTmaWarpSpecializedCooperative>) &&
+       detail::is_use_rmem_A<ElementA, GmemLayoutATag, ElementB, GmemLayoutBTag>()>
+> {
+  static_assert(cutlass::detail::dependent_false<ElementA>, "Mainloop with sparse A sourced from RF is not implemented.");
+};
+
+// Sparse GMMA auto kernel schedule
+template <
+  class ElementA,
+  class GmemLayoutATag,
+  int AlignmentA,
+  class ElementB,
+  class GmemLayoutBTag,
+  int AlignmentB,
+  class ElementAccumulator,
+  class TileShape_MNK,
+  class ClusterShape_MNK,
+  class StageCountType,
+  class KernelScheduleType
+>
+struct CollectiveBuilder<
+    arch::Sm90,
+    arch::OpClassSparseTensorOp,
+    ElementA,
+    GmemLayoutATag,
+    AlignmentA,
+    ElementB,
+    GmemLayoutBTag,
+    AlignmentB,
+    ElementAccumulator,
+    TileShape_MNK,
+    ClusterShape_MNK,
+    StageCountType,
+    KernelScheduleType,
+    cute::enable_if_t<cute::is_same_v<KernelScheduleType, KernelScheduleAuto>>
+> {
+  static_assert(is_static<TileShape_MNK>::value);
+  static_assert(is_static<ClusterShape_MNK>::value);
+#ifndef CUTLASS_SM90_COLLECTIVE_BUILDER_SUPPORTED
+  static_assert(cutlass::detail::dependent_false<ElementA>, "Unsupported Toolkit for SM90 Collective Builder\n");
+#endif
+
+  static constexpr bool IsFP8Input = detail::is_input_fp8<ElementA, ElementB>();
+
+  using KernelSchedule = cute::conditional_t<size<0>(TileShape_MNK{}) == Int<64>{},
+                                             cute::conditional_t<IsFP8Input,
+                                                                 KernelTmaWarpSpecializedPingpongFP8FastAccum,
+                                                                 KernelTmaWarpSpecializedPingpong>,
+                                             cute::conditional_t<IsFP8Input,
+                                                                 KernelTmaWarpSpecializedCooperativeFP8FastAccum,
+                                                                 KernelTmaWarpSpecializedCooperative>>;
+
+  using CollectiveOp = typename CollectiveBuilder<
+      arch::Sm90,
+      arch::OpClassSparseTensorOp,
+      ElementA,
+      GmemLayoutATag,
+      AlignmentA,
+      ElementB,
+      GmemLayoutBTag,
+      AlignmentB,
+      ElementAccumulator,
+      TileShape_MNK,
+      ClusterShape_MNK,
+      StageCountType,
+      KernelSchedule
+    >::CollectiveOp;
+};
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+} // namespace cutlass::gemm::collective
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/include/cutlass/gemm/collective/collective_builder.hpp b/include/cutlass/gemm/collective/collective_builder.hpp
index 3698cdfc60..fa31aebaa1 100644
--- a/include/cutlass/gemm/collective/collective_builder.hpp
+++ b/include/cutlass/gemm/collective/collective_builder.hpp
@@ -38,6 +38,7 @@
 
 #include "cutlass/gemm/collective/collective_builder_decl.hpp"
 #include "cutlass/gemm/collective/builders/sm90_gmma_builder.inl"
+#include "cutlass/gemm/collective/builders/sm90_sparse_gmma_builder.inl"
 
 #if defined(SYCL_INTEL_TARGET)
 #include "cutlass/gemm/collective/builders/xe_mma_builder.inl"
diff --git a/include/cutlass/gemm/collective/collective_mma.hpp b/include/cutlass/gemm/collective/collective_mma.hpp
index 7a65382449..9a1fc2c3d4 100644
--- a/include/cutlass/gemm/collective/collective_mma.hpp
+++ b/include/cutlass/gemm/collective/collective_mma.hpp
@@ -43,6 +43,7 @@
 #include "cutlass/gemm/collective/sm90_mma_tma_gmma_rs_warpspecialized.hpp"
 #include "cutlass/gemm/collective/sm90_mma_tma_gmma_rs_warpspecialized_mixed_input.hpp"
 #include "cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp"
+#include "cutlass/gemm/collective/sm90_sparse_mma_tma_gmma_ss_warpspecialized.hpp"
 #include "cutlass/gemm/collective/sm90_mma_array_tma_gmma_ss_warpspecialized.hpp"
 #include "cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized_fp8.hpp"
 
diff --git a/include/cutlass/gemm/collective/sm90_mma_array_tma_gmma_ss_warpspecialized.hpp b/include/cutlass/gemm/collective/sm90_mma_array_tma_gmma_ss_warpspecialized.hpp
index 7b16648392..80e128c3a0 100644
--- a/include/cutlass/gemm/collective/sm90_mma_array_tma_gmma_ss_warpspecialized.hpp
+++ b/include/cutlass/gemm/collective/sm90_mma_array_tma_gmma_ss_warpspecialized.hpp
@@ -166,12 +166,12 @@ struct CollectiveMma<
       size<0>(ClusterShape{}))); // mcast along M mode for this N load, if any
 
   struct SharedStorage {
-    struct TensorStorage : cute::aligned_struct<128> {
+    struct TensorStorage : cute::aligned_struct<128, _0> {
       cute::array_aligned<typename TiledMma::ValTypeA, cute::cosize_v<SmemLayoutA>> smem_A;
       cute::array_aligned<typename TiledMma::ValTypeB, cute::cosize_v<SmemLayoutB>> smem_B;
     } tensors;
 
-    struct TensorMapStorage : cute::aligned_struct<128> {
+    struct TensorMapStorage : cute::aligned_struct<128, _0> {
       cute::TmaDescriptor smem_tensormap_A;
       cute::TmaDescriptor smem_tensormap_B;
     } tensormaps;
@@ -623,56 +623,42 @@ struct CollectiveMma<
   //
 
   CUTLASS_DEVICE auto
-  tensormaps_init(Params const& mainloop_params, int32_t const sm_count, int32_t const sm_idx) const {
+  tensormaps_init(
+      Params const& mainloop_params,
+      TensorMapStorage& shared_tensormaps,
+      int32_t sm_count,
+      int32_t sm_idx) {
     cute::TmaDescriptor* gmem_tensormap = reinterpret_cast<cute::TmaDescriptor*>(mainloop_params.tensormaps);
 
     cute::TmaDescriptor* tma_desc_a = &gmem_tensormap[sm_idx];
     cute::TmaDescriptor* tma_desc_b = &gmem_tensormap[sm_idx + sm_count];
 
     if (cute::elect_one_sync()) {
-      // Bringing tensormaps from params to gmem for modification later
+      // Bringing tensormaps from params to smem for modification later
       Tensor pA_tensormap = make_tensor(mainloop_params.tma_load_a.get_tma_descriptor(), Int<1>{}, Int<1>{});
-      Tensor gA_tensormap = make_tensor(tma_desc_a, Int<1>{}, Int<1>{});
+      Tensor sA_tensormap = make_tensor(make_smem_ptr(&shared_tensormaps.smem_tensormap_A), Int<1>{}, Int<1>{});
       Tensor pB_tensormap = make_tensor(mainloop_params.tma_load_b.get_tma_descriptor(), Int<1>{}, Int<1>{});
-      Tensor gB_tensormap = make_tensor(tma_desc_b, Int<1>{}, Int<1>{});
+      Tensor sB_tensormap = make_tensor(make_smem_ptr(&shared_tensormaps.smem_tensormap_B), Int<1>{}, Int<1>{});
 
-      copy(recast<uint128_t>(pA_tensormap), recast<uint128_t>(gA_tensormap));
-      copy(recast<uint128_t>(pB_tensormap), recast<uint128_t>(gB_tensormap));
+      copy(recast<uint128_t>(pA_tensormap), recast<uint128_t>(sA_tensormap));
+      copy(recast<uint128_t>(pB_tensormap), recast<uint128_t>(sB_tensormap));
     }
+    syncwarp();
 
     return cute::make_tuple(tma_desc_a, tma_desc_b);
   }
 
-  // Bringing tensormaps to smem (to be done by single thread)
-  template <class TensorMapA, class TensorMapB>
-  CUTLASS_DEVICE
-  void
-  tensormaps_fetch_to_smem(
-      TensorMapStorage& shared_tensormap,
-      cute::tuple<TensorMapA, TensorMapB> const& input_tensormaps) const {
-    Tensor gA_tensormap = make_tensor(make_gmem_ptr(get<0>(input_tensormaps)), Int<1>{}, Int<1>{});
-    Tensor sA_tensormap = make_tensor(make_smem_ptr(&shared_tensormap.smem_tensormap_A), Int<1>{}, Int<1>{});
-    Tensor gB_tensormap = make_tensor(make_gmem_ptr(get<1>(input_tensormaps)), Int<1>{}, Int<1>{});
-    Tensor sB_tensormap = make_tensor(make_smem_ptr(&shared_tensormap.smem_tensormap_B), Int<1>{}, Int<1>{});
-
-    copy(recast<uint128_t>(gA_tensormap), recast<uint128_t>(sA_tensormap));
-    copy(recast<uint128_t>(gB_tensormap), recast<uint128_t>(sB_tensormap));
-
-    cp_async_fence();
-    cp_async_wait<0>();
-  }
-
   // Replace address for the global tensor (to be done by single thread)
   CUTLASS_DEVICE
   void
   tensormaps_replace_global_address(
-      TensorMapStorage& shared_tensormap,
+      TensorMapStorage& shared_tensormaps,
       Params const& mainloop_params,
       int32_t next_batch) {
     // Replacing global_address for the next batch
-    cute::tma_descriptor_replace_addr_in_shared_mem(shared_tensormap.smem_tensormap_A,
+    cute::tma_descriptor_replace_addr_in_shared_mem(shared_tensormaps.smem_tensormap_A,
                                                     mainloop_params.ptr_A[next_batch]);
-    cute::tma_descriptor_replace_addr_in_shared_mem(shared_tensormap.smem_tensormap_B,
+    cute::tma_descriptor_replace_addr_in_shared_mem(shared_tensormaps.smem_tensormap_B,
                                                     mainloop_params.ptr_B[next_batch]);
   }
 
@@ -681,21 +667,19 @@ struct CollectiveMma<
   CUTLASS_DEVICE
   void
   tensormaps_replace_global_tensor_properties(
-      TensorMapStorage& shared_tensormap,
+      TensorMapStorage& shared_tensormaps,
       Params const& mainloop_params,
       int32_t next_group,
       ProblemShape_MNKL problem_shape_mnkl) {
     const uint32_t M = get<0>(problem_shape_mnkl);
     const uint32_t N = get<1>(problem_shape_mnkl);
     const uint32_t K = get<2>(problem_shape_mnkl);
-    // Only consider dimensions and strides that we need to recalculate and replace for each group
-    constexpr int TensorRank = rank(ProblemShape_MNKL{}) - 1; // excluding either M or N
-    static_assert(TensorRank == Int<3>{},
-      "Descriptor modification for global dims & strides expects rank as 3.");
-    cute::array<uint32_t, TensorRank> prob_shape_A  = {1,1,1};
-    cute::array<uint64_t, TensorRank> prob_stride_A = {0,0,0};
-    cute::array<uint32_t, TensorRank> prob_shape_B  = {1,1,1};
-    cute::array<uint64_t, TensorRank> prob_stride_B = {0,0,0};
+    // Replace all dims for consistency
+    constexpr int MaxTensorRank = 5;
+    cute::array<uint32_t, MaxTensorRank> prob_shape_A  = {1,1,1,1,1};
+    cute::array<uint64_t, MaxTensorRank> prob_stride_A = {0,0,0,0,0};
+    cute::array<uint32_t, MaxTensorRank> prob_shape_B  = {1,1,1,1,1};
+    cute::array<uint64_t, MaxTensorRank> prob_stride_B = {0,0,0,0,0};
 
     InternalElementA const* ptr_A = nullptr;
     Tensor tensor_a = make_tensor(ptr_A, make_shape(M,K,Int<1>{}), mainloop_params.dA[next_group]);
@@ -716,10 +700,10 @@ struct CollectiveMma<
       stride = (stride * sizeof_bits_v<InternalElementB>) / 8;
     }
 
-    cute::tma_descriptor_replace_dims_strides_in_shared_mem(shared_tensormap.smem_tensormap_A,
+    cute::tma_descriptor_replace_dims_strides_in_shared_mem(shared_tensormaps.smem_tensormap_A,
                                                             prob_shape_A,
                                                             prob_stride_A);
-    cute::tma_descriptor_replace_dims_strides_in_shared_mem(shared_tensormap.smem_tensormap_B,
+    cute::tma_descriptor_replace_dims_strides_in_shared_mem(shared_tensormaps.smem_tensormap_B,
                                                             prob_shape_B,
                                                             prob_stride_B);
   }
@@ -728,21 +712,18 @@ struct CollectiveMma<
   CUTLASS_DEVICE
   void
   tensormaps_perform_update(
-      TensorMapStorage& shared_tensormap,
+      TensorMapStorage& shared_tensormaps,
       Params const& mainloop_params,
       cute::tuple<TensorMapA, TensorMapB> const& input_tensormaps,
       ProblemShape_MNKL problem_shape_mnkl,
       int32_t next_batch) {
     if (cute::elect_one_sync()) {
-      // Bringing tensormaps to smem
-      tensormaps_fetch_to_smem(shared_tensormap, input_tensormaps);
-
       // Replacing global_address for the next batch
-      tensormaps_replace_global_address(shared_tensormap, mainloop_params, next_batch);
+      tensormaps_replace_global_address(shared_tensormaps, mainloop_params, next_batch);
 
       if constexpr (IsGroupedGemmKernel) {
         // Replacing global dims and strides for the next batch
-        tensormaps_replace_global_tensor_properties(shared_tensormap,
+        tensormaps_replace_global_tensor_properties(shared_tensormaps,
           mainloop_params, next_batch, problem_shape_mnkl);
       }
     }
@@ -752,11 +733,11 @@ struct CollectiveMma<
   CUTLASS_DEVICE
   void
   tensormaps_cp_fence_release (
-      TensorMapStorage& shared_tensormap,
+      TensorMapStorage& shared_tensormaps,
       cute::tuple<TensorMapA, TensorMapB> const& input_tensormaps) {
     // Entire warp must do this (i.e. it's aligned)
-    tma_descriptor_cp_fence_release(get<0>(input_tensormaps), shared_tensormap.smem_tensormap_A);
-    tma_descriptor_cp_fence_release(get<1>(input_tensormaps), shared_tensormap.smem_tensormap_B);
+    tma_descriptor_cp_fence_release(get<0>(input_tensormaps), shared_tensormaps.smem_tensormap_A);
+    tma_descriptor_cp_fence_release(get<1>(input_tensormaps), shared_tensormaps.smem_tensormap_B);
   }
 
   // The entire warp must call this function collectively (that is, the instructions are aligned)
diff --git a/include/cutlass/gemm/collective/sm90_mma_multistage_gmma_rs_warpspecialized.hpp b/include/cutlass/gemm/collective/sm90_mma_multistage_gmma_rs_warpspecialized.hpp
index b00a85881f..bdaea1d90d 100644
--- a/include/cutlass/gemm/collective/sm90_mma_multistage_gmma_rs_warpspecialized.hpp
+++ b/include/cutlass/gemm/collective/sm90_mma_multistage_gmma_rs_warpspecialized.hpp
@@ -187,7 +187,7 @@ struct CollectiveMma<
 
   struct SharedStorage
   {
-    struct TensorStorage : cute::aligned_struct<256> { 
+    struct TensorStorage : cute::aligned_struct<256, _0> { 
       cute::array_aligned<typename TiledMma::ValTypeA, cute::cosize_v<SmemLayoutA>, 256> smem_A;
       cute::array_aligned<typename TiledMma::ValTypeB, cute::cosize_v<SmemLayoutB>, 256> smem_B;
     } tensors;
diff --git a/include/cutlass/gemm/collective/sm90_mma_multistage_gmma_ss_warpspecialized.hpp b/include/cutlass/gemm/collective/sm90_mma_multistage_gmma_ss_warpspecialized.hpp
index 1afe7e956e..bea0d3d8d0 100644
--- a/include/cutlass/gemm/collective/sm90_mma_multistage_gmma_ss_warpspecialized.hpp
+++ b/include/cutlass/gemm/collective/sm90_mma_multistage_gmma_ss_warpspecialized.hpp
@@ -135,7 +135,7 @@ struct CollectiveMma<
 
   struct SharedStorage
   {
-    struct TensorStorage : cute::aligned_struct<128> {
+    struct TensorStorage : cute::aligned_struct<128, _0> {
       cute::array_aligned<typename TiledMma::ValTypeA, cute::cosize_v<SmemLayoutA>> smem_A;
       cute::array_aligned<typename TiledMma::ValTypeB, cute::cosize_v<SmemLayoutB>> smem_B;
     } tensors;
diff --git a/include/cutlass/gemm/collective/sm90_mma_tma_gmma_rs_warpspecialized.hpp b/include/cutlass/gemm/collective/sm90_mma_tma_gmma_rs_warpspecialized.hpp
index 888d27b0c5..202a66e709 100644
--- a/include/cutlass/gemm/collective/sm90_mma_tma_gmma_rs_warpspecialized.hpp
+++ b/include/cutlass/gemm/collective/sm90_mma_tma_gmma_rs_warpspecialized.hpp
@@ -213,7 +213,7 @@ struct CollectiveMma<
 
   struct SharedStorage
   {
-    struct TensorStorage : cute::aligned_struct<cute::max(SmemAlignmentA, SmemAlignmentB)> { 
+    struct TensorStorage : cute::aligned_struct<cute::max(SmemAlignmentA, SmemAlignmentB), _0> { 
       cute::array_aligned<typename TiledMma::ValTypeA, cute::cosize_v<SmemLayoutA>, SmemAlignmentA> smem_A;
       cute::array_aligned<typename TiledMma::ValTypeB, cute::cosize_v<SmemLayoutB>, SmemAlignmentB> smem_B;
     } tensors;
diff --git a/include/cutlass/gemm/collective/sm90_mma_tma_gmma_rs_warpspecialized_mixed_input.hpp b/include/cutlass/gemm/collective/sm90_mma_tma_gmma_rs_warpspecialized_mixed_input.hpp
index c0dc1c26fa..a6cc0783d1 100644
--- a/include/cutlass/gemm/collective/sm90_mma_tma_gmma_rs_warpspecialized_mixed_input.hpp
+++ b/include/cutlass/gemm/collective/sm90_mma_tma_gmma_rs_warpspecialized_mixed_input.hpp
@@ -174,7 +174,7 @@ struct CollectiveMma<
 
   using SmemCopyAtomA = SmemCopyAtomA_;
   using SmemCopyAtomB = SmemCopyAtomB_;
-  using SmemCopyAtomScale = Copy_Atom<cute::DefaultCopy, NonVoidElementScale>;
+  using SmemCopyAtomScale = Copy_Atom<cute::AutoVectorizingCopy, NonVoidElementScale>;
 
   // We must ensure the type to be scaled goes to RF
   static constexpr bool SwapAB = !IsATransformed;
@@ -182,6 +182,7 @@ struct CollectiveMma<
   using InternalSmemLayoutAtomB = cute::conditional_t<!SwapAB, SmemLayoutAtomB, SmemLayoutAtomA>;
   using InternalSmemCopyAtomA   = cute::conditional_t<!SwapAB, SmemCopyAtomA, SmemCopyAtomB>;
   using InternalSmemCopyAtomB   = cute::conditional_t<!SwapAB, SmemCopyAtomB, SmemCopyAtomA>;
+  
   // TMA converts f32 input to tf32 when copying from GMEM to SMEM
   // For all other types, cast to size equivalent uint type to avoid any rounding by TMA.
   static constexpr bool ConvertF32toTF32A = cute::is_same_v<float, ElementA>;
@@ -202,6 +203,7 @@ struct CollectiveMma<
 
   static constexpr int IsSubbyteA = cute::sizeof_bits_v<InternalElementA> < 8;
   using TmaElementA = cute::conditional_t<IsSubbyteA, uint8_t, InternalElementA>;
+  using TmaElementScale = uint_bit_t<sizeof_bits_v<NonVoidElementScale> >; // in case we have array. translating to uint to satisfy tma descriptor's specialization
 
   using ArchTag = typename DispatchPolicy::ArchTag;
 
@@ -227,14 +229,25 @@ struct CollectiveMma<
   static_assert((size<2>(TileShape{}) % size<1>(SmemLayoutAtomScale{})) == 0, "SmemLayoutAtomScale must evenly divide tile k shape.");
 
   // Tile along modes in a way that maximizes the TMA box size.
-  using SmemLayoutA = decltype(tile_to_shape(
-      InternalSmemLayoutAtomA{},
-      make_shape(shape<0>(TileShape{}), shape<2>(TileShape{}), Int<DispatchPolicy::Stages>{}),
-      cute::conditional_t< ::cutlass::gemm::detail::is_major<0,InternalStrideA>(), Step<_2,_1,_3>, Step<_1,_2,_3>>{}));
-  using SmemLayoutB = decltype(tile_to_shape(
-      InternalSmemLayoutAtomB{},
-      make_shape(shape<1>(TileShape{}), shape<2>(TileShape{}), Int<DispatchPolicy::Stages>{}),
-      cute::conditional_t< ::cutlass::gemm::detail::is_major<0,InternalStrideB>(), Step<_2,_1,_3>, Step<_1,_2,_3>>{}));
+
+  template<class LayoutAtom, class TileShape, class Stride>
+  static constexpr
+  CUTLASS_HOST_DEVICE
+  auto get_smem_layout(LayoutAtom layout_atom, TileShape const& tile_shape, Stride const& stride) {
+    if constexpr (not cute::is_layout<Stride>::value) {
+      return tile_to_shape(
+        layout_atom,
+        append(tile_shape, Int<DispatchPolicy::Stages>{}),
+        cute::conditional_t< ::cutlass::gemm::detail::is_major<0,Stride>(), Step<_2,_1,_3>, Step<_1,_2,_3>>{});
+    }
+    else {
+      auto gmem_tile = composition(stride, tile_shape);
+      return make_layout_like(append(gmem_tile, make_layout(Int<DispatchPolicy::Stages>{}, 0)));
+    }
+  }
+
+  using SmemLayoutA = decltype(get_smem_layout(InternalSmemLayoutAtomA{}, select<0,2>(TileShape{}), InternalStrideA{}));
+  using SmemLayoutB = decltype(get_smem_layout(InternalSmemLayoutAtomB{}, select<1,2>(TileShape{}), InternalStrideB{}));
     
   // It is assumed that the scales and zero-points share the same smem layout
   using SmemLayoutScale = decltype(tile_to_shape(
@@ -273,6 +286,8 @@ struct CollectiveMma<
   static constexpr ConversionMode KernelConversionMode = get_conversion_mode();
   static constexpr bool ModeHasScales = KernelConversionMode == ConversionMode::ConvertAndScale ||
                                         KernelConversionMode == ConversionMode::ConvertAndScaleWithZero;
+  static constexpr bool UseScaleLookupTable = KernelConversionMode == ConversionMode::ConvertAndScale &&
+                                              cutlass::detail::is_Array_v<ElementScale>;
 
   static constexpr auto
   elements_per_smem_scale() {
@@ -304,22 +319,30 @@ struct CollectiveMma<
   // These methods use some the public members of the class. For that reason, we define them after the public section.
   static constexpr uint32_t
   compute_tma_transaction_bytes_mk() {
-    constexpr uint32_t baseline_bytes = cutlass::bits_to_bytes(size<0>(SmemLayoutA{}) * size<1>(SmemLayoutA{}) * static_cast<uint32_t>(cute::sizeof_bits_v<InternalElementA>));
+    return cutlass::bits_to_bytes(size<0>(SmemLayoutA{}) * size<1>(SmemLayoutA{}) * static_cast<uint32_t>(cute::sizeof_bits_v<InternalElementA>));
+  }
+
+  static constexpr uint32_t
+  compute_tma_transaction_bytes_nk() {
+    return cutlass::bits_to_bytes(size<0>(SmemLayoutB{}) * size<1>(SmemLayoutB{}) * static_cast<uint32_t>(cute::sizeof_bits_v<InternalElementB>));
+  }
 
+  static constexpr uint32_t
+  compute_tma_transaction_bytes_extra() {
     if constexpr (KernelConversionMode == ConversionMode::DirectConvert) {
-      return baseline_bytes;
+      return 0;
     }
     else if constexpr (ModeHasScales) {
       constexpr uint32_t scale_tx_bytes = cutlass::bits_to_bytes(size<0>(SmemLayoutScale{}) * size<1>(SmemLayoutScale{}) * static_cast<uint32_t>(cute::sizeof_bits_v<ElementScale>));
       static_assert(scale_tx_bytes % 128 == 0, "Each scale stage must be 128B aligned."); // required by TMA
       if constexpr (KernelConversionMode == ConversionMode::ConvertAndScale) {
-        return baseline_bytes + scale_tx_bytes;
+        return scale_tx_bytes;
       }
       else if constexpr (KernelConversionMode == ConversionMode::ConvertAndScaleWithZero) {
         // Scale and zero share smem layout
         constexpr uint32_t zero_tx_bytes = cutlass::bits_to_bytes(size<0>(SmemLayoutScale{}) * size<1>(SmemLayoutScale{}) * static_cast<uint32_t>(cute::sizeof_bits_v<ElementZero>));
         static_assert(zero_tx_bytes % 128 == 0, "Each zero stage must be 128B aligned."); // required by TMA
-        return baseline_bytes + scale_tx_bytes + zero_tx_bytes;
+        return scale_tx_bytes + zero_tx_bytes;
       }
       else {
         static_assert(cutlass::detail::dependent_false<KernelSchedule>, "Type not handled in tma transaction bytes computation.");
@@ -330,11 +353,6 @@ struct CollectiveMma<
     }
   }
 
-  static constexpr uint32_t
-  compute_tma_transaction_bytes_nk() {
-    return cutlass::bits_to_bytes(size<0>(SmemLayoutB{}) * size<1>(SmemLayoutB{}) * static_cast<uint32_t>(cute::sizeof_bits_v<InternalElementB>));
-  }
-
 public:
   static constexpr size_t SmemAlignmentA = cutlass::detail::alignment_for_swizzle(SmemLayoutA{}); 
 
@@ -349,7 +367,7 @@ struct CollectiveMma<
   {
     static constexpr int scale_elements = elements_per_smem_scale();
     static constexpr int zero_elements = elements_per_smem_zero();
-    struct TensorStorage : cute::aligned_struct<cute::max(SmemAlignmentA, SmemAlignmentB)> {
+    struct TensorStorage : cute::aligned_struct<cute::max(SmemAlignmentA, SmemAlignmentB), _0> {
       cute::ArrayEngine<RealInternalElementA, cute::cosize_v<SmemLayoutA>> smem_A;
       cute::ArrayEngine<typename TiledMma::ValTypeB, cute::cosize_v<SmemLayoutB>> smem_B;
       cute::ArrayEngine<NonVoidElementScale, scale_elements> smem_scale;
@@ -375,6 +393,18 @@ struct CollectiveMma<
     uint32_t mma_promotion_interval = 4;
   };
 
+  template<class Shape, class Stride>
+  static constexpr
+  CUTLASS_HOST_DEVICE
+  auto get_gmem_layout(Shape const& shape, Stride const& stride) {
+    if constexpr (not cute::is_layout<Stride>::value) {
+      return make_layout(shape, stride);
+    }
+    else {
+      return stride;
+    }
+  }
+
   // Device side kernel params
   struct Params {
   private:
@@ -388,15 +418,19 @@ struct CollectiveMma<
                                 TransformB_>;
 
   public:
+
     // Assumption: StrideA is congruent with Problem_MK
-    using TMA_A = decltype(make_tma_copy<TmaElementA>(
+    using LayoutA = decltype(get_gmem_layout(repeat_like(InternalStrideA{}, int32_t(0)), InternalStrideA{}));
+    using LayoutB = decltype(get_gmem_layout(repeat_like(InternalStrideB{}, int32_t(0)), InternalStrideB{}));
+
+    using TMA_A = decltype(make_tma_copy_A_sm90<TmaElementA>(
         GmemTiledCopyA{},
-        make_tensor(Outer::get_logical_ptr(static_cast<InternalElementA const*>(nullptr)), repeat_like(InternalStrideA{}, int32_t(0)), InternalStrideA{}),
+        make_tensor(Outer::get_logical_ptr(static_cast<InternalElementA const*>(nullptr)), LayoutA{}),
         SmemLayoutA{}(_,_,cute::Int<0>{}),
-        make_shape(shape<0>(TileShape{}), shape<2>(TileShape{})),
-        size<1>(ClusterShape{})));  // mcast along N mode for this M load, if any
+        TileShape{},
+        ClusterShape{}));  // mcast along N mode for this M load, if any
 
-   using TMA_Scale = decltype(make_tma_copy(
+   using TMA_Scale = decltype(make_tma_copy<TmaElementScale>(
         GmemTiledCopyScale{},
         make_tensor(Outer::get_logical_ptr(static_cast<NonVoidElementScale const*>(nullptr)), repeat_like(NonVoidStrideScale{}, int32_t(0)), NonVoidStrideScale{}),
         SmemLayoutScale{}(_,_,cute::Int<0>{}),
@@ -411,12 +445,12 @@ struct CollectiveMma<
         _1{}));  // mcast along N mode for this M load, if any. Scale is ALWAYS loaded with A for RF kernel
 
     // Assumption: StrideB is congruent with Problem_NK
-    using TMA_B = decltype(make_tma_copy(
+    using TMA_B = decltype(make_tma_copy_B_sm90(
         GmemTiledCopyB{},
-        make_tensor(Outer::get_logical_ptr(static_cast<InternalElementB const*>(nullptr)), repeat_like(InternalStrideB{}, int32_t(0)), InternalStrideB{}),
+        make_tensor(Outer::get_logical_ptr(static_cast<InternalElementB const*>(nullptr)), LayoutB{}),
         SmemLayoutB{}(_,_,cute::Int<0>{}),
-        make_shape(shape<1>(TileShape{}), shape<2>(TileShape{})),
-        size<0>(ClusterShape{}))); // mcast along M mode for this N load, if any
+        TileShape{},
+        ClusterShape{})); // mcast along M mode for this N load, if any
     TMA_A tma_load_a;
     TMA_B tma_load_b;
     TMA_Scale tma_load_scale;
@@ -424,8 +458,9 @@ struct CollectiveMma<
     int64_t scale_k;
     int group_size;
     uint32_t tma_transaction_bytes = TmaTransactionBytes;
-    uint32_t tma_transaction_bytes_mk = TmaTransactionBytesMK;
-    uint32_t tma_transaction_bytes_nk = TmaTransactionBytesNK;
+    int reload_factor = (group_size + size<2>(TileShape{}) - 1) / size<2>(TileShape{});
+    InternalStrideA dA;
+    InternalStrideB dB;
   };
 
   //
@@ -464,33 +499,35 @@ struct CollectiveMma<
       dB = args.dA;
     }
 
-    Tensor tensor_a = make_tensor(get_logical_ptr(ptr_A), make_layout(make_shape(M,K,L), dA));
-    Tensor tensor_b = make_tensor(get_logical_ptr(ptr_B), make_layout(make_shape(N,K,L), dB));
-    typename Params::TMA_A tma_load_a = make_tma_copy<TmaElementA>(
+    Tensor tensor_a = make_tensor(get_logical_ptr(ptr_A), get_gmem_layout(make_shape(M,K,L), dA));
+    Tensor tensor_b = make_tensor(get_logical_ptr(ptr_B), get_gmem_layout(make_shape(N,K,L), dB));
+    typename Params::TMA_A tma_load_a = make_tma_copy_A_sm90<TmaElementA>(
         GmemTiledCopyA{},
         tensor_a,
         SmemLayoutA{}(_,_,cute::Int<0>{}),
-        make_shape(shape<0>(TileShape{}), shape<2>(TileShape{})),
-        size<1>(ClusterShape{})); // mcast along N mode for this M load, if any
+        TileShape{},
+        ClusterShape{}); // mcast along N mode for this M load, if any
 
-    typename Params::TMA_B tma_load_b = make_tma_copy(
+    typename Params::TMA_B tma_load_b = make_tma_copy_B_sm90(
         GmemTiledCopyB{},
         tensor_b,
         SmemLayoutB{}(_,_,cute::Int<0>{}),
-        make_shape(shape<1>(TileShape{}), shape<2>(TileShape{})),
-        size<0>(ClusterShape{})); // mcast along M mode for this N load, if any
+        TileShape{},
+        ClusterShape{}); // mcast along M mode for this N load, if any
+
+    typename Params::TMA_Scale tma_load_scale{};
+    typename Params::TMA_Zero tma_load_zero{};
 
-    typename Params::TMA_Scale tma_load_scale;
-    typename Params::TMA_Zero tma_load_zero;
+    uint32_t tma_transaction_bytes = TmaTransactionBytesMK + TmaTransactionBytesNK;
     if constexpr (KernelConversionMode == ConversionMode::DirectConvert) {
-      return { tma_load_a, tma_load_b, tma_load_scale, tma_load_zero, 0, 0, TmaTransactionBytes, TmaTransactionBytesMK, TmaTransactionBytesNK };
+      return { tma_load_a, tma_load_b, tma_load_scale, tma_load_zero, 0, 0, tma_transaction_bytes, 1, dA, dB };
     } 
     else if constexpr (ModeHasScales) {
       auto scale_k = (K + args.group_size - 1) / args.group_size;
       ElementScale const* ptr_S = args.ptr_S;
       StrideScale dS = args.dS;
       Tensor tensor_scale = make_tensor(get_logical_ptr(ptr_S), make_layout(make_shape(M,scale_k,L), dS));
-      tma_load_scale = make_tma_copy(
+      tma_load_scale = make_tma_copy<TmaElementScale>(
           GmemTiledCopyScale{},
           tensor_scale,
           SmemLayoutScale{}(_,_,cute::Int<0>{}),
@@ -498,7 +535,7 @@ struct CollectiveMma<
           _1{}); // mcast along N mode for this M load, if any
 
       if constexpr(KernelConversionMode == ConversionMode::ConvertAndScale) {
-        return { tma_load_a, tma_load_b, tma_load_scale, tma_load_zero, scale_k, args.group_size, TmaTransactionBytes, TmaTransactionBytesMK, TmaTransactionBytesNK };
+        return { tma_load_a, tma_load_b, tma_load_scale, tma_load_zero, scale_k, args.group_size, tma_transaction_bytes + TmaTransactionBytesExtra, (args.group_size + size<2>(TileShape{}) - 1) / size<2>(TileShape{}), dA, dB };
       }
       else if constexpr(KernelConversionMode == ConversionMode::ConvertAndScaleWithZero) {
         Tensor tensor_zero = make_tensor(get_logical_ptr(args.ptr_Z), make_layout(make_shape(M,scale_k,L), dS));
@@ -508,7 +545,7 @@ struct CollectiveMma<
             SmemLayoutScale{}(_,_,cute::Int<0>{}),
             ScaleTileShape{},
             _1{}); // mcast along N mode for this M load, if any
-        return { tma_load_a, tma_load_b, tma_load_scale, tma_load_zero, scale_k, args.group_size, TmaTransactionBytes, TmaTransactionBytesMK, TmaTransactionBytesNK };
+        return { tma_load_a, tma_load_b, tma_load_scale, tma_load_zero, scale_k, args.group_size, tma_transaction_bytes + TmaTransactionBytesExtra, (args.group_size + size<2>(TileShape{}) - 1) / size<2>(TileShape{}), dA, dB };
       } else {
         static_assert(cutlass::detail::dependent_false<KernelSchedule>, "Conversion mode not handled in to_underlying_arguments.");
       }
@@ -526,33 +563,37 @@ struct CollectiveMma<
     constexpr int tma_alignment_bits = 128;
     auto problem_shape_MNKL = append<4>(problem_shape, 1);
     auto [M,N,K,L] = problem_shape_MNKL;
-    
-    bool implementable = true;
+
     constexpr int min_tma_aligned_elements_A = tma_alignment_bits / cutlass::sizeof_bits<ElementA>::value;
-    implementable = implementable && cutlass::detail::check_alignment<min_tma_aligned_elements_A>(cute::make_shape(M,K,L), StrideA{});
+    bool check_aligned_A = cutlass::detail::check_alignment<min_tma_aligned_elements_A>(get_gmem_layout(cute::make_shape(M,K,L), args.dA));
+
     constexpr int min_tma_aligned_elements_B = tma_alignment_bits / cutlass::sizeof_bits<ElementB>::value;
-    implementable = implementable && cutlass::detail::check_alignment<min_tma_aligned_elements_B>(cute::make_shape(N,K,L), StrideB{});
+    bool check_aligned_B = cutlass::detail::check_alignment<min_tma_aligned_elements_B>(get_gmem_layout(cute::make_shape(N,K,L), args.dB));
+
+    bool check_aligned_S = true;
+    bool check_aligned_Z = true;
+    bool check_mode_args = true;
 
     if constexpr (KernelConversionMode == ConversionMode::DirectConvert) {
-      implementable = implementable && (args.ptr_S == nullptr);
-      implementable = implementable && (args.ptr_Z == nullptr);
+      check_mode_args = check_mode_args && (args.ptr_S == nullptr);
+      check_mode_args = check_mode_args && (args.ptr_Z == nullptr);
     } 
     else if constexpr (ModeHasScales) {
       const int scale_mn = SwapAB ? N : M;
       const int scale_k = (K + args.group_size - 1) / args.group_size;
       constexpr int min_tma_aligned_elements_scale = tma_alignment_bits / cutlass::sizeof_bits<ElementScale>::value;
-      implementable = implementable && cutlass::detail::check_alignment<min_tma_aligned_elements_scale>(cute::make_shape(scale_mn,scale_k,L), StrideScale{});
-      implementable = implementable && (args.group_size == K || ((args.group_size % size<2>(TileShape{})) == 0));
-      implementable = implementable && args.group_size != 0;
-      implementable = implementable && (args.ptr_S != nullptr);
+      check_aligned_S = cutlass::detail::check_alignment<min_tma_aligned_elements_scale>(cute::make_shape(scale_mn,scale_k,L), args.dS);
+      check_mode_args = check_mode_args && (args.group_size == K || ((args.group_size % size<2>(TileShape{})) == 0));
+      check_mode_args = check_mode_args && args.group_size != 0;
+      check_mode_args = check_mode_args && (args.ptr_S != nullptr);
 
       if constexpr (KernelConversionMode == ConversionMode::ConvertAndScale) {
-        implementable = implementable && (args.ptr_Z == nullptr);
+        check_mode_args = check_mode_args && (args.ptr_Z == nullptr);
       }
       else if constexpr (KernelConversionMode == ConversionMode::ConvertAndScaleWithZero) {
         constexpr int min_tma_aligned_elements_zero = tma_alignment_bits / cutlass::sizeof_bits<ElementZero>::value;
-        implementable = implementable && cutlass::detail::check_alignment<min_tma_aligned_elements_zero>(cute::make_shape(scale_mn,scale_k,L), StrideScale{});
-        implementable = implementable && (args.ptr_Z != nullptr);
+        check_aligned_Z = cutlass::detail::check_alignment<min_tma_aligned_elements_zero>(cute::make_shape(scale_mn,scale_k,L), args.dS);
+        check_mode_args = check_mode_args && (args.ptr_Z != nullptr);
       } 
       else {
         static_assert(cutlass::detail::dependent_false<KernelSchedule>, "Conversion mode not handled in can_implement.");
@@ -562,16 +603,30 @@ struct CollectiveMma<
       static_assert(cutlass::detail::dependent_false<KernelSchedule>, "Conversion mode not handled in can_implement.");
     }
 
-    if (!implementable) {
-      CUTLASS_TRACE_HOST("  CAN IMPLEMENT: Problem Size doesn't meet the minimum alignment requirements for TMA.\n");
+    if (!check_mode_args) {
+      CUTLASS_TRACE_HOST("  CAN IMPLEMENT: Invalid arguments for the selected conversion mode.\n");
+    }
+    if (!check_aligned_A) {
+      CUTLASS_TRACE_HOST("  CAN IMPLEMENT: Tensor A meet the minimum alignment requirements for TMA.\n");
+    }
+    if (!check_aligned_B) {
+      CUTLASS_TRACE_HOST("  CAN IMPLEMENT: Tensor B meet the minimum alignment requirements for TMA.\n");
     }
-    return implementable;
+    if (!check_aligned_S) {
+      CUTLASS_TRACE_HOST("  CAN IMPLEMENT: Tensor S (scale) meet the minimum alignment requirements for TMA.\n");
+    }
+    if (!check_aligned_Z) {
+      CUTLASS_TRACE_HOST("  CAN IMPLEMENT: Tensor Z (zeros) meet the minimum alignment requirements for TMA.\n");
+    }
+
+    return check_mode_args && check_aligned_A && check_aligned_B && check_aligned_S && check_aligned_Z;
   }
 
   static constexpr int K_PIPE_MAX = DispatchPolicy::Stages;
   static constexpr uint32_t TmaTransactionBytesMK = compute_tma_transaction_bytes_mk();
   static constexpr uint32_t TmaTransactionBytesNK = compute_tma_transaction_bytes_nk();
-  static constexpr uint32_t TmaTransactionBytes = TmaTransactionBytesMK + TmaTransactionBytesNK;
+  static constexpr uint32_t TmaTransactionBytesExtra = compute_tma_transaction_bytes_extra();
+  static constexpr uint32_t TmaTransactionBytes = TmaTransactionBytesMK + TmaTransactionBytesNK + TmaTransactionBytesExtra;
 
   /// Issue Tma Descriptor Prefetch -- ideally from a single thread for best performance
   CUTLASS_DEVICE
@@ -610,8 +665,8 @@ struct CollectiveMma<
 
     // TMA requires special handling of strides to deal with coord codomain mapping
     // Represent the full tensors -- get these from TMA
-    Tensor mA_mkl = mainloop_params.tma_load_a.get_tma_tensor(make_shape(M,K,L));                            // (m,k,l)
-    Tensor mB_nkl = mainloop_params.tma_load_b.get_tma_tensor(make_shape(N,K,L));                            // (n,k,l)
+    Tensor mA_mkl = mainloop_params.tma_load_a.get_tma_tensor(shape(get_gmem_layout(make_shape(M,K,L), mainloop_params.dA))); // (m,k,l)
+    Tensor mB_nkl = mainloop_params.tma_load_b.get_tma_tensor(shape(get_gmem_layout(make_shape(N,K,L), mainloop_params.dB))); // (n,k,l)
 
     // Make tiled views, defer the slice
     Tensor gA_mkl = local_tile(mA_mkl, TileShape{}, make_coord(_,_,_), Step<_1, X,_1>{});        // (BLK_M,BLK_K,m,k,l)
@@ -672,124 +727,119 @@ struct CollectiveMma<
       static_assert(cutlass::detail::dependent_false<KernelSchedule>, "Conversion mode not handled in TMA load.");
     }
 
-    int lane_predicate = cute::elect_one_sync();
+    Tensor sA_ = make_tensor(make_smem_ptr(shared_tensors.smem_A.begin()), SmemLayoutA{});      // (BLK_M,BLK_K,PIPE)
+    Tensor sB_ = make_tensor(make_smem_ptr(shared_tensors.smem_B.begin()), SmemLayoutB{});      // (BLK_N,BLK_K,PIPE)
+    Tensor sA  = as_position_independent_swizzle_tensor(sA_);                                   // (BLK_M,BLK_K,PIPE)
+    Tensor sB  = as_position_independent_swizzle_tensor(sB_);                                   // (BLK_N,BLK_K,PIPE)
 
-    if (lane_predicate) {
-      Tensor sA_ = make_tensor(make_smem_ptr(shared_tensors.smem_A.begin()), SmemLayoutA{});      // (BLK_M,BLK_K,PIPE)
-      Tensor sB_ = make_tensor(make_smem_ptr(shared_tensors.smem_B.begin()), SmemLayoutB{});      // (BLK_N,BLK_K,PIPE)
-      Tensor sA  = as_position_independent_swizzle_tensor(sA_);                                   // (BLK_M,BLK_K,PIPE)
-      Tensor sB  = as_position_independent_swizzle_tensor(sB_);                                   // (BLK_N,BLK_K,PIPE)
-
-      //
-      // Prepare the TMA loads for A, B and Scales
-      //
-      
-      constexpr uint32_t cluster_shape_x = get<0>(ClusterShape());
-      uint2 cluster_local_block_id = {block_rank_in_cluster % cluster_shape_x, block_rank_in_cluster / cluster_shape_x};
+    //
+    // Prepare the TMA loads for A, B and Scales
+    //
+    
+    constexpr uint32_t cluster_shape_x = get<0>(ClusterShape());
+    uint2 cluster_local_block_id = {block_rank_in_cluster % cluster_shape_x, block_rank_in_cluster / cluster_shape_x};
 
-      Tensor gA_mkl = get<0>(load_inputs);
-      Tensor gB_nkl = get<1>(load_inputs);
+    Tensor gA_mkl = get<0>(load_inputs);
+    Tensor gB_nkl = get<1>(load_inputs);
 
-      auto block_tma_a = mainloop_params.tma_load_a.get_slice(cluster_local_block_id.y);
-      auto block_tma_b = mainloop_params.tma_load_b.get_slice(cluster_local_block_id.x);
+    auto block_tma_a = mainloop_params.tma_load_a.get_slice(cluster_local_block_id.y);
+    auto block_tma_b = mainloop_params.tma_load_b.get_slice(cluster_local_block_id.x);
 
-      // Partition the inputs based on the current block coordinates.
-      auto [m_coord, n_coord, k_coord, l_coord] = blk_coord;
-      Tensor gA = gA_mkl(_,_,m_coord,_,l_coord);                                                     // (BLK_M,BLK_K,k)
-      Tensor gB = gB_nkl(_,_,n_coord,_,l_coord);                                                     // (BLK_N,BLK_K,k)
+    // Partition the inputs based on the current block coordinates.
+    auto [m_coord, n_coord, k_coord, l_coord] = blk_coord;
+    Tensor gA = gA_mkl(_,_,m_coord,_,l_coord);                                                     // (BLK_M,BLK_K,k)
+    Tensor gB = gB_nkl(_,_,n_coord,_,l_coord);                                                     // (BLK_N,BLK_K,k)
 
-      // Applies the mapping from block_tma_a
-      Tensor tAgA = block_tma_a.partition_S(gA);                                                 // (TMA,TMA_M,TMA_K,k)
-      Tensor tAsA = block_tma_a.partition_D(sA);                                              // (TMA,TMA_M,TMA_K,PIPE)
+    // Applies the mapping from block_tma_a
+    Tensor tAgA = block_tma_a.partition_S(gA);                                                 // (TMA,TMA_M,TMA_K,k)
+    Tensor tAsA = block_tma_a.partition_D(sA);                                              // (TMA,TMA_M,TMA_K,PIPE)
 
-      Tensor tBgB = block_tma_b.partition_S(gB);                                                 // (TMA,TMA_N,TMA_K,k)
-      Tensor tBsB = block_tma_b.partition_D(sB);                                              // (TMA,TMA_N,TMA_K,PIPE)
+    Tensor tBgB = block_tma_b.partition_S(gB);                                                 // (TMA,TMA_N,TMA_K,k)
+    Tensor tBsB = block_tma_b.partition_D(sB);                                              // (TMA,TMA_N,TMA_K,PIPE)
 
-      uint16_t mcast_mask_a = 0;
-      uint16_t mcast_mask_b = 0;
-      uint16_t mcast_mask_s = 0;
+    uint16_t mcast_mask_a = 0;
+    uint16_t mcast_mask_b = 0;
+    uint16_t mcast_mask_s = 0;
 
-      // Issue TmaLoads
-      // Maps the tile -> block, value
-      if constexpr (cute::is_same_v<GmemTiledCopyA, SM90_TMA_LOAD_MULTICAST>) {
-        auto block_layout = Layout<typename DispatchPolicy::ClusterShape>{};                       // (m,n) -> block_id
-        for (int n = 0; n < size<1>(block_layout); ++n) {
-          mcast_mask_a |= (uint16_t(1) << block_layout(cluster_local_block_id.x,n,Int<0>{}));
-        }
+    // Issue TmaLoads
+    // Maps the tile -> block, value
+    if constexpr (cute::is_same_v<GmemTiledCopyA, SM90_TMA_LOAD_MULTICAST>) {
+      auto block_layout = Layout<typename DispatchPolicy::ClusterShape>{};                       // (m,n) -> block_id
+      for (int n = 0; n < size<1>(block_layout); ++n) {
+        mcast_mask_a |= (uint16_t(1) << block_layout(cluster_local_block_id.x,n,Int<0>{}));
       }
+    }
 
-      if constexpr (cute::is_same_v<GmemTiledCopyB, SM90_TMA_LOAD_MULTICAST>) {
-        auto block_layout = Layout<typename DispatchPolicy::ClusterShape>{};                       // (m,n) -> block_id
-        for (int m = 0; m < size<0>(block_layout); ++m) {
-          mcast_mask_b |= (uint16_t(1) << block_layout(m,cluster_local_block_id.y,Int<0>{}));
-        }
+    if constexpr (cute::is_same_v<GmemTiledCopyB, SM90_TMA_LOAD_MULTICAST>) {
+      auto block_layout = Layout<typename DispatchPolicy::ClusterShape>{};                       // (m,n) -> block_id
+      for (int m = 0; m < size<0>(block_layout); ++m) {
+        mcast_mask_b |= (uint16_t(1) << block_layout(m,cluster_local_block_id.y,Int<0>{}));
       }
+    }
 
-      auto extra_input_partitions = partition_extra_tma_inputs(mainloop_params, load_inputs, shared_tensors, cluster_local_block_id, m_coord, l_coord);
+    auto extra_input_partitions = partition_extra_tma_inputs(mainloop_params, load_inputs, shared_tensors, cluster_local_block_id, m_coord, l_coord);
 
-      // Mainloop
-      CUTLASS_PRAGMA_NO_UNROLL
-      for ( ; k_tile_count > 0; --k_tile_count) {
-        // LOCK smem_pipe_write for _writing_
-        pipeline.producer_acquire(smem_pipe_write);
+    // Mainloop
+    CUTLASS_PRAGMA_NO_UNROLL
+    for ( ; k_tile_count > 0; --k_tile_count) {
+      // LOCK smem_pipe_write for _writing_
+      pipeline.producer_acquire(smem_pipe_write);
 
-        //
-        // Copy gmem to smem for *k_tile_iter
-        //
+      //
+      // Copy gmem to smem for *k_tile_iter
+      //
 
-        using BarrierType = typename MainloopPipeline::ProducerBarrierType;
-        BarrierType* tma_barrier = pipeline.producer_get_barrier(smem_pipe_write);
+      using BarrierType = typename MainloopPipeline::ProducerBarrierType;
+      BarrierType* tma_barrier = pipeline.producer_get_barrier(smem_pipe_write);
 
-        int write_stage = smem_pipe_write.index();
+      int write_stage = smem_pipe_write.index();
+      if (cute::elect_one_sync()) {
         copy(mainloop_params.tma_load_a.with(*tma_barrier, mcast_mask_a), tAgA(_,_,_,*k_tile_iter), tAsA(_,_,_,write_stage));
         copy(mainloop_params.tma_load_b.with(*tma_barrier, mcast_mask_b), tBgB(_,_,_,*k_tile_iter), tBsB(_,_,_,write_stage));
+      }
 
-        if constexpr (KernelConversionMode == ConversionMode::DirectConvert) {
-          // Nothing extra to do.
-        }
-        else if constexpr (ModeHasScales) {
-          auto tSgS = get<0>(extra_input_partitions);
-          auto tSsS = get<1>(extra_input_partitions);
-
-          // Temporary factor which will determine which k tile to reload from gmem. Needed so we don't modify tma transaction bytes
-          // on the fly.
-          // We must do a ceiling divide here to correctly handle with group_size == K. In that case, we don't require that K
-          // is a multiple of the threadblock tile K
-          const int ReloadFactor = (mainloop_params.group_size + size<2>(TileShape{}) - 1) / size<2>(TileShape{});
-          const int scale_load_k = *k_tile_iter / ReloadFactor; // This will always be 0 when group_size == K.
-          copy(mainloop_params.tma_load_scale.with(*tma_barrier, mcast_mask_s), tSgS(_,_,_,scale_load_k), tSsS(_,_,_,write_stage));
-
-          if constexpr (KernelConversionMode == ConversionMode::ConvertAndScale) {
-            // Nothing extra to do
-          } 
-          else if constexpr (KernelConversionMode == ConversionMode::ConvertAndScaleWithZero) {
-            auto tZgZ = get<2>(extra_input_partitions);
-            auto tZsZ = get<3>(extra_input_partitions);
-            copy(mainloop_params.tma_load_zero.with(*tma_barrier, mcast_mask_s), tZgZ(_,_,_,scale_load_k), tZsZ(_,_,_,write_stage));
-          }
-          else {
-            static_assert(cutlass::detail::dependent_false<KernelSchedule>, "Conversion mode not handled for TMA copy op.");
-          } 
+      if constexpr (KernelConversionMode == ConversionMode::DirectConvert) {
+        // Nothing extra to do.
+      }
+      else if constexpr (ModeHasScales) {
+        auto tSgS = get<0>(extra_input_partitions);
+        auto tSsS = get<1>(extra_input_partitions);
+
+        // Temporary factor which will determine which k tile to reload from gmem. Needed so we don't modify tma transaction bytes
+        // on the fly.
+        // We must do a ceiling divide here to correctly handle with group_size == K. In that case, we don't require that K
+        // is a multiple of the threadblock tile K
+        int const scale_load_k = *k_tile_iter / mainloop_params.reload_factor; // This will always be 0 when group_size == K.
+        if (cute::elect_one_sync()) copy(mainloop_params.tma_load_scale.with(*tma_barrier, mcast_mask_s), tSgS(_,_,_,scale_load_k), tSsS(_,_,_,write_stage));
+
+        if constexpr (KernelConversionMode == ConversionMode::ConvertAndScale) {
+          // Nothing extra to do
         } 
+        else if constexpr (KernelConversionMode == ConversionMode::ConvertAndScaleWithZero) {
+          auto tZgZ = get<2>(extra_input_partitions);
+          auto tZsZ = get<3>(extra_input_partitions);
+          if (cute::elect_one_sync()) copy(mainloop_params.tma_load_zero.with(*tma_barrier, mcast_mask_s), tZgZ(_,_,_,scale_load_k), tZsZ(_,_,_,write_stage));
+        }
         else {
           static_assert(cutlass::detail::dependent_false<KernelSchedule>, "Conversion mode not handled for TMA copy op.");
-        }
+        } 
+      } 
+      else {
+        static_assert(cutlass::detail::dependent_false<KernelSchedule>, "Conversion mode not handled for TMA copy op.");
+      }
 
-        ++k_tile_iter;
+      ++k_tile_iter;
 
-        // Advance smem_pipe_write
-        ++smem_pipe_write;
-      }
+      // Advance smem_pipe_write
+      ++smem_pipe_write;
     }
   }
 
   /// Perform a Producer Epilogue to prevent early exit of blocks in a Cluster
   CUTLASS_DEVICE void
   load_tail(MainloopPipeline pipeline, PipelineState smem_pipe_write) {
-    int lane_predicate = cute::elect_one_sync();
-
     // Issue the epilogue waits
-    if (lane_predicate) {
+    if (cute::elect_one_sync()) {
       /* This helps avoid early exit of blocks in Cluster
        * Waits for all stages to either be released (all 
        * Consumer UNLOCKs), or if the stage was never used
@@ -868,13 +918,6 @@ struct CollectiveMma<
 
     Tensor tCrA_copy_view  = smem_thr_copy_A.retile_D(tCrA_load);                                  // (CPY,CPY_M,CPY_K)
 
-    // Compute the max vector length that can be used to copy A. This will match the vector width of the 
-    // conversions used. It helps by allowing the compiler to convert using the same register that was used
-    // to load the data from smem. This significantly reduces the need to move data among registers.
-    // Note that this is correct even if copy fails to vectorize, since the granularity at which we perform
-    // the conversion does not impact correctness.
-    using A_CPY_VEC = decltype(max_common_vector(tCsA, tCrA_copy_view));
-
     // Partition of thread -> shared and thread -> RF
     auto partitioned_extra_info = partition_extra_mma_info(mma_thread_slice, shared_tensors);
     auto copy_partitions_extra_info = retile_extra_mma_info(tiled_mma, partitioned_extra_info, warp_group_thread_idx);
@@ -915,32 +958,42 @@ struct CollectiveMma<
       // copy smem->rmem for A operand
       copy_A_and_extra_info(smem_tiled_copy_A, tCsA, tCrA_copy_view, 
         partitioned_extra_info, copy_partitions_extra_info, 0, read_stage);
-
-      transform_A_kblock(tCrA_load, A_CPY_VEC{}, tCrA_mma, partitioned_extra_info, 0);
+      if (K_BLOCK_MAX > 1) { // prefetch next block
+        copy_A_and_extra_info(smem_tiled_copy_A, tCsA, tCrA_copy_view, 
+          partitioned_extra_info, copy_partitions_extra_info, 1, read_stage);
+      }
+      transform_A_kblock(tCrA_load, tCrA_mma, partitioned_extra_info, 0);
       
       // Unroll the K mode manually to set scale D to 1
       CUTLASS_PRAGMA_UNROLL
       for (int k_block = 0; k_block < K_BLOCK_MAX; ++k_block) {
-        if (k_block < K_BLOCK_MAX - 1) {
-          copy_A_and_extra_info(smem_tiled_copy_A, tCsA, tCrA_copy_view, 
-            partitioned_extra_info, copy_partitions_extra_info, k_block + 1, read_stage);
-          transform_A_kblock(tCrA_load, A_CPY_VEC{}, tCrA_mma, partitioned_extra_info, k_block + 1);
-        }
         warpgroup_arrive();
         // (V,M) x (V,N) => (V,M,N)
         cute::gemm(tiled_mma, tCrA_mma(_,_,k_block), tCrB(_,_,k_block,read_stage), accum);
         tiled_mma.accumulate_ = GMMA::ScaleOut::One;
         warpgroup_commit_batch();
+
+        if (k_block < K_BLOCK_MAX - 2) { // prefetch next block
+          copy_A_and_extra_info(smem_tiled_copy_A, tCsA, tCrA_copy_view, 
+            partitioned_extra_info, copy_partitions_extra_info, k_block + 2, read_stage);
+        }
+        if (k_block < K_BLOCK_MAX - 1) {
+          transform_A_kblock(tCrA_load, tCrA_mma, partitioned_extra_info, k_block + 1);
+        }
       }     
 
       --k_tile_count;
       if (k_tile_count > 0) {
         // Wait for K_BLOCK_MAX - 1 to be in flight to ensure that it is safe to overwrite the A registers for the first mma.
-        warpgroup_wait<K_BLOCK_MAX - 1>(); 
         pipeline.consumer_wait(smem_pipe_read, barrier_token);
         copy_A_and_extra_info(smem_tiled_copy_A, tCsA, tCrA_copy_view, 
           partitioned_extra_info, copy_partitions_extra_info, 0, smem_pipe_read.index());
-        transform_A_kblock(tCrA_load, A_CPY_VEC{}, tCrA_mma, partitioned_extra_info, 0);
+        if (K_BLOCK_MAX > 1) { // prefetch next block
+          copy_A_and_extra_info(smem_tiled_copy_A, tCsA, tCrA_copy_view, 
+            partitioned_extra_info, copy_partitions_extra_info, 1, smem_pipe_read.index());
+        }
+        warpgroup_wait<K_BLOCK_MAX - 1>(); 
+        transform_A_kblock(tCrA_load, tCrA_mma, partitioned_extra_info, 0);
       }
     }
 
@@ -971,9 +1024,8 @@ struct CollectiveMma<
         tiled_mma.accumulate_ = GMMA::ScaleOut::One;
         warpgroup_commit_batch();
 
-        warpgroup_wait<K_BLOCK_MAX - 1>();
+        warpgroup_wait<K_BLOCK_MAX - 1>(); // We have K_BLOCK_MAX - 1 GMMA instructions pending for this stage, so we can release prior barrier
         if (k_block == K_BLOCK_MAX - 1) {
-          // We have K_BLOCK_MAX - 1 GMMA instructions pending for this stage, so we can release prior barrier
           pipeline.consumer_release(smem_pipe_release);             // UNLOCK smem_pipe_release, done _computing_ on it
           ++smem_pipe_release;
         }
@@ -986,12 +1038,18 @@ struct CollectiveMma<
           pipeline.consumer_wait(smem_pipe_read, barrier_token);
           copy_A_and_extra_info(smem_tiled_copy_A, tCsA, tCrA_copy_view, 
             partitioned_extra_info, copy_partitions_extra_info, 0, smem_pipe_read.index());
-          transform_A_kblock(tCrA_load, A_CPY_VEC{}, tCrA_mma, partitioned_extra_info, 0);
+          if (K_BLOCK_MAX > 1) { // prefetch next block
+            copy_A_and_extra_info(smem_tiled_copy_A, tCsA, tCrA_copy_view, 
+              partitioned_extra_info, copy_partitions_extra_info, 1, smem_pipe_read.index());
+          }
+          transform_A_kblock(tCrA_load, tCrA_mma, partitioned_extra_info, 0);
         } 
         else {
-          copy_A_and_extra_info(smem_tiled_copy_A, tCsA, tCrA_copy_view, 
-            partitioned_extra_info, copy_partitions_extra_info, k_block + 1, read_stage);
-          transform_A_kblock(tCrA_load, A_CPY_VEC{}, tCrA_mma, partitioned_extra_info, k_block + 1);
+          if (k_block < K_BLOCK_MAX - 2) { // prefetch next block
+            copy_A_and_extra_info(smem_tiled_copy_A, tCsA, tCrA_copy_view, 
+              partitioned_extra_info, copy_partitions_extra_info, k_block + 2, read_stage);
+          }
+          transform_A_kblock(tCrA_load, tCrA_mma, partitioned_extra_info, k_block + 1);
         }
       }
       warpgroup_fence_operand(accum);
@@ -1018,17 +1076,21 @@ struct CollectiveMma<
         cute::gemm(tiled_mma, tCrA_mma(_,_,k_block), tCrB(_,_,k_block,read_stage), accum);
         tiled_mma.accumulate_ = GMMA::ScaleOut::One;
         warpgroup_commit_batch();
+
         warpgroup_wait<K_BLOCK_MAX - 1>();
-        if (k_block == K_BLOCK_MAX - 1) {
-          // release prior barrier
+        if (k_block == K_BLOCK_MAX - 1) { // release prior barrier
           pipeline.consumer_release(smem_pipe_release);             // UNLOCK smem_pipe_release, done _computing_ on it
           ++smem_pipe_release;
         }
 
+        if (k_block < K_BLOCK_MAX - 2) { // prefetch next block
+          copy_A_and_extra_info(smem_tiled_copy_A, tCsA, tCrA_copy_view, 
+            partitioned_extra_info, copy_partitions_extra_info, k_block + 2, read_stage);
+        }
         if (k_block < K_BLOCK_MAX - 1) {
           copy_A_and_extra_info(smem_tiled_copy_A, tCsA, tCrA_copy_view, 
             partitioned_extra_info, copy_partitions_extra_info, k_block + 1, read_stage);
-          transform_A_kblock(tCrA_load, A_CPY_VEC{}, tCrA_mma, partitioned_extra_info, k_block + 1);
+          transform_A_kblock(tCrA_load, tCrA_mma, partitioned_extra_info, k_block + 1);
         }
       }
     }
@@ -1110,10 +1172,20 @@ struct CollectiveMma<
       // nothing to do
       return cute::make_tuple();
     }
+    else if constexpr (UseScaleLookupTable) {
+      Tensor sS = make_tensor(make_smem_ptr(shared_tensors.smem_scale.begin()), SmemLayoutScale{});// (BLK_M,BLK_SCALE_K,PIPE)
+      Tensor tCsS = mma_thread_slice.partition_A(sS);
+      Tensor tCrS_neg = make_tensor<ElementScale>(mma_thread_slice.partition_fragment_A(sS(_,_,Int<0>{})).layout()); 
+      Tensor tCrS_pos = make_tensor<ElementScale>(mma_thread_slice.partition_fragment_A(sS(_,_,Int<0>{})).layout()); 
+
+      if constexpr (KernelConversionMode == ConversionMode::ConvertAndScale) {
+        return cute::make_tuple(tCsS, tCrS_neg, tCrS_pos);
+      }
+    }
     else if constexpr (ModeHasScales) {
       Tensor sS = make_tensor(make_smem_ptr(shared_tensors.smem_scale.begin()), SmemLayoutScale{});// (BLK_M,BLK_SCALE_K,PIPE)
       Tensor tCsS = mma_thread_slice.partition_A(sS);
-      Tensor tCrS = make_tensor<ElementScale>(mma_thread_slice.partition_fragment_A(sS(_,_,Int<0>{})).shape()); 
+      Tensor tCrS = make_tensor<ElementScale>(mma_thread_slice.partition_fragment_A(sS(_,_,Int<0>{})).layout()); 
 
       if constexpr (KernelConversionMode == ConversionMode::ConvertAndScale) {
         return cute::make_tuple(tCsS, tCrS);
@@ -1121,7 +1193,7 @@ struct CollectiveMma<
       else if constexpr (KernelConversionMode == ConversionMode::ConvertAndScaleWithZero) {
         Tensor sZ = make_tensor(make_smem_ptr(shared_tensors.smem_zero.begin()), SmemLayoutScale{});// (BLK_M,BLK_SCALE_K,PIPE)
         Tensor tCsZ = mma_thread_slice.partition_A(sZ);
-        Tensor tCrZ = make_tensor<ElementZero>(mma_thread_slice.partition_fragment_A(sZ(_,_,Int<0>{})).shape()); 
+        Tensor tCrZ = make_tensor<ElementZero>(mma_thread_slice.partition_fragment_A(sZ(_,_,Int<0>{})).layout()); 
         return cute::make_tuple(tCsS, tCrS, tCsZ, tCrZ);
       }
       else {
@@ -1210,159 +1282,275 @@ struct CollectiveMma<
       }
     }
   }
+  
+  // Helper functions to select packing for conversion
+  template <class SrcType,
+            class DstType,
+            int Cosize>
+  struct select_packing { // Naive packing policy
+    static constexpr auto value() {
+      return Int<cute::gcd(Cosize, 32 / cute::min(sizeof_bits_v<SrcType>, sizeof_bits_v<DstType>))>{};
+    }
+  };
+
+  CUTLASS_DEVICE
+  static uint32_t to_reg(Array<cutlass::int4b_t, 4> const& source) {
+    return static_cast<uint32_t>(
+      reinterpret_cast<uint16_t const&>(source));
+  }
+  CUTLASS_DEVICE
+  static uint32_t to_reg(Array<cutlass::int4b_t, 8> const& source) {
+    return reinterpret_cast<uint32_t const&>(source);
+  }
+  // The core converter uses a lookup table to converts i4 -> 8 bit value.
+  template <class TensorPos,
+            class TensorNeg,
+            int N>
+  CUTLASS_DEVICE
+  static Array<RealInternalElementB, N> lookup_table_convert(
+    cute::Int<N> _,
+    Array<cutlass::int4b_t, N> const& source,
+    TensorPos const& scale_neg, 
+    TensorNeg const& scale_pos, 
+    int scale_idx) {
+
+    static_assert(N == 4 || N == 8);
+    uint32_t res[N / 4];
+
+    // View the input as reg
+    uint32_t reg = to_reg(source);
+
+    // Determines if to get from the signed or unsigned candidates
+    static constexpr uint32_t immLut = (0xf0 & 0xcc) | 0xaa;
+    uint32_t sign; // ((reg & 0x88888888) | 0x64206420) >> 1 
+    asm volatile(
+      "{\n"
+      "  lop3.b32 %0, %1, %2, %3, %4;\n" \
+      "}\n"
+      : "=r"(sign)
+      : "r"(reg), "n"(0x88888888), "n"(0x64206420), "n"(immLut)
+    );
+    sign = sign >> 1;
+
+    // Ignore sign bit when indexing into LUT
+    uint32_t lut_idx = reg & 0x77777777;
+    CUTLASS_PRAGMA_UNROLL
+    for (int i = 0; i < N / 4; ++i, lut_idx >>=16, sign >>=16) {
+      Array<uint32_t, 2> const& _scale_neg = reinterpret_cast<Array<uint32_t, 2> const&>(scale_neg[scale_idx + i * 4]);
+      Array<uint32_t, 2> const& _scale_pos = reinterpret_cast<Array<uint32_t, 2> const&>(scale_pos[scale_idx + i * 4]);
+      asm volatile(
+        "{\n"
+        "  .reg .b32 pos, neg                    ;\n" \
+        "  prmt .b32 neg, %3, %4, %1             ;\n" \
+        "  prmt .b32 pos, %5, %6, %1             ;\n" \
+        "  prmt .b32 %0, pos, neg, %2            ;\n" \
+        "}\n"
+        : "=r"(res[i])
+        : "r"(lut_idx), "r"(sign), "r"(_scale_neg[0]), "r"(_scale_neg[1]), "r"(_scale_pos[0]), "r"(_scale_pos[1])
+      );
+    }
+    return reinterpret_cast<Array<RealInternalElementB, N>&>(res);
+  }
+
+  template <class Layout>
+  CUTLASS_DEVICE
+  static void static_check_scale(Layout const& tensor) {
+    static_assert(shape<0>(Layout{}) >= 4 && stride<0>(Layout{}) == 0, "At least 4 adjacent weights in a thread must share the same scale.");
+  }
+  template <class Engine,
+            class Layout>
+  CUTLASS_DEVICE
+  static void static_check_scale(Tensor<Engine, Layout> const& tensor) {
+    static_check_scale(flatten(Layout{}));
+  }
 
   /// Utilities to transform A.
-  template <class TCrA_load,
-            int VectorWidthA, 
-            class TCrA_mma,
+  template <class EngineIn,
+            class EngineOut, 
+            class LayoutIn,
+            class LayoutOut,
             class... Ts>
   CUTLASS_DEVICE
   void transform_A_kblock(
-    TCrA_load const& tCrA_load, 
-    cute::Int<VectorWidthA> vec_A,
-    TCrA_mma& tCrA_mma,
+    Tensor<EngineIn, LayoutIn> const& tCrA_load, 
+    Tensor<EngineOut, LayoutOut>& tCrA_mma,
     cute::tuple<Ts...> const& partitioned_extra_info,
     int const k_block) {
 
+    static_assert(is_rmem<EngineIn>::value, "Input tensor for A conversion must come from registers");
+    static_assert(is_rmem<EngineOut>::value, "Output tensor for A conversion must come from registers");
+    static_assert(cosize_v<LayoutIn> == cosize_v<LayoutOut>);
+    static_assert(size_v<LayoutIn> == cosize_v<LayoutIn>);
+    static_assert(size_v<LayoutOut> == cosize_v<LayoutOut>);
+    using SrcType = typename EngineIn::value_type;
+    using DstType = typename EngineOut::value_type;
+
+    auto const& src = tCrA_load(_, _, k_block);
+    auto const& dst = tCrA_mma(_, _, k_block);
+    auto pSrc = raw_pointer_cast(src.data());
+    auto pDst = const_cast<DstType*>(raw_pointer_cast(dst.data()));
+    constexpr int num_elements = decltype(size(src))::value;
+
     if constexpr (KernelConversionMode == ConversionMode::DirectConvert) {
-      transform_internal_A(tCrA_load(_, _, k_block), vec_A, tCrA_mma(_, _, k_block));
+      constexpr int pack = decltype(select_packing<SrcType, DstType, num_elements>::value())::value;
+      using Converter = cutlass::NumericArrayConverter<DstType, SrcType, pack, cutlass::FloatRoundStyle::round_to_nearest>;
+      using SrcArray = cutlass::Array<SrcType, pack>;
+      using DstArray = cutlass::Array<DstType, pack>;
+      constexpr int iters = num_elements / pack;
+
+      CUTLASS_PRAGMA_UNROLL
+      for (int i = 0; i < iters; ++i) {
+        SrcArray const* pSrcArr = reinterpret_cast<SrcArray const*>(pSrc) + i;
+        DstArray* pDstArr = reinterpret_cast<DstArray*>(pDst) + i;
+        *pDstArr = Converter::convert(*pSrcArr);
+      }
     } 
+    else if constexpr (UseScaleLookupTable) {
+      static_assert(is_same_v<RealInternalElementA, cutlass::int4b_t>, "Lookup table only supports int4 being the quant type now.");
+      static_assert(sizeof_bits_v<ElementScale> == 64, "Lookup table only supports 8 8bit scale values now.");
+      static_assert(num_elements % 4 == 0 && num_elements >= 4, "Lookup table requires a vector size of 4x when converting.");
+      constexpr int pack = num_elements % 8 == 0? 8 : 4;
+      constexpr int iters = num_elements / pack;
+      using SrcArray = cutlass::Array<SrcType, pack>;
+      using DstArray = cutlass::Array<DstType, pack>;
+
+      auto const& tCrS_neg = cute::get<1>(partitioned_extra_info);
+      auto const& tCrS_pos = cute::get<2>(partitioned_extra_info);
+      auto const& scale_neg = tCrS_neg(_, _, k_block);
+      auto const& scale_pos = tCrS_pos(_, _, k_block);
+      CUTE_STATIC_ASSERT_V(size(src) == size(scale_neg));
+
+      static_check_scale(scale_neg);
+      static_check_scale(scale_pos);
+      if (k_block == 0) {
+        auto pNeg = raw_pointer_cast(tCrS_neg.data());
+        auto pPos = const_cast<ElementScale*>(raw_pointer_cast(tCrS_pos.data()));
+        CUTLASS_PRAGMA_UNROLL
+        for (int i = 0; i < cosize(tCrS_neg.layout()); ++i)
+        {
+          // pPos[i] = pNeg[i] & 0x7F7F7F7F7F7F7F00;
+          cutlass::Array<uint32_t, 2> const& _scale_neg = reinterpret_cast<cutlass::Array<uint32_t, 2> const&>(pNeg[i]);
+          cutlass::Array<uint32_t, 2> & _scale_pos = reinterpret_cast<cutlass::Array<uint32_t, 2> &>(pPos[i]);
+          asm volatile(
+              "{\n"
+              "  and  .b32 %0, %2, %4             ;\n" \
+              "  and  .b32 %1, %3, %5             ;\n" \
+              "}\n"
+              : "=r"(_scale_pos[0]), "=r"(_scale_pos[1])
+              : "r"(_scale_neg[0]), "r"(_scale_neg[1]), "n"(0x7F7F7F00), "n"(0x7F7F7F7F)
+              );
+        }
+      }
+      CUTLASS_PRAGMA_UNROLL
+      for (int i = 0; i < iters; i ++) {
+        SrcArray const* pSrcArr = reinterpret_cast<SrcArray const*>(raw_pointer_cast(src.data())) + i;
+        DstArray* pDstArr = reinterpret_cast<DstArray*>(raw_pointer_cast(dst.data())) + i;
+        
+        *pDstArr = lookup_table_convert(Int<pack>{}, *pSrcArr, scale_neg, scale_pos, i * pack);
+      }
+    }
     else if constexpr (KernelConversionMode == ConversionMode::ConvertAndScale) {
-      auto tCrS = cute::get<1>(partitioned_extra_info);
-      transform_internal_A(tCrA_load(_, _, k_block), vec_A, make_fragment_like<ElementScale>(tCrA_mma)(_, _, k_block), tCrS(_, _, 0), tCrA_mma(_, _, k_block));
+      auto const& scales = cute::get<1>(partitioned_extra_info)(_, _, k_block);
+      CUTE_STATIC_ASSERT_V(size(src) == size(scales));
+
+      if constexpr (is_same_v<DstType, ElementScale>) {
+        constexpr int pack = decltype(select_packing<SrcType, DstType, num_elements>::value())::value;
+        using Converter = cutlass::NumericArrayConverter<DstType, SrcType, pack, cutlass::FloatRoundStyle::round_to_nearest>;
+        using SrcArray = cutlass::Array<SrcType, pack>;
+        using DstArray = cutlass::Array<DstType, pack>;
+        constexpr int iters = num_elements / pack;
+
+        CUTLASS_PRAGMA_UNROLL
+        for (int i = 0; i < iters; ++i) {
+          SrcArray const* pSrcArr = reinterpret_cast<SrcArray const*>(pSrc) + i;
+          DstArray* pDstArr = reinterpret_cast<DstArray*>(pDst) + i;
+          *pDstArr = Converter::convert(*pSrcArr);
+          CUTLASS_PRAGMA_UNROLL
+          for (int j = 0; j < pack; ++j) {
+            (*pDstArr)[j] = (*pDstArr)[j] * scales[i*pack + j];
+          }
+        }
+      }
+      else {
+        constexpr int pack1 = decltype(select_packing<SrcType, ElementScale, num_elements>::value())::value;
+        constexpr int pack2 = decltype(select_packing<ElementScale, DstType, num_elements>::value())::value;
+        constexpr int pack = cute::gcd(pack1, pack2);
+        using Converter1 = cutlass::NumericArrayConverter<ElementScale, SrcType, pack, cutlass::FloatRoundStyle::round_to_nearest>;
+        using Converter2 = cutlass::NumericArrayConverter<DstType, ElementScale, pack, cutlass::FloatRoundStyle::round_to_nearest>;
+        using SrcArray = cutlass::Array<SrcType, pack>;
+        using DstArray = cutlass::Array<DstType, pack>;
+        using StageArray = cutlass::Array<ElementScale, pack>;
+        constexpr int iters = num_elements / pack;
+
+        CUTLASS_PRAGMA_UNROLL
+        for (int i = 0; i < iters; ++i) {
+          SrcArray const* pSrcArr = reinterpret_cast<SrcArray const*>(pSrc) + i;
+          DstArray* pDstArr = reinterpret_cast<DstArray*>(pDst) + i;
+          StageArray stageArr;
+          stageArr = Converter1::convert(*pSrcArr);
+          CUTLASS_PRAGMA_UNROLL
+          for (int j = 0; j < pack; ++j) {
+            stageArr[j] = stageArr[j] *  scales[i*pack + j];
+          }
+          *pDstArr = Converter2::convert(stageArr);
+        }
+      }
     } 
     else if constexpr (KernelConversionMode == ConversionMode::ConvertAndScaleWithZero) {
-      auto tCrS = cute::get<1>(partitioned_extra_info);
-      auto tCrZ = cute::get<3>(partitioned_extra_info);
-      transform_internal_A(tCrA_load(_, _, k_block), 
-                           vec_A, 
-                           make_fragment_like<ElementScale>(tCrA_mma)(_, _, k_block), 
-                           tCrS(_, _, 0),
-                           tCrZ(_, _, 0),
-                           make_fragment_like<ElementScale>(tCrZ)(_, _, 0), 
-                           tCrA_mma(_, _, k_block));        
+      static_assert(is_same_v<ElementScale, ElementZero>, "ElementScale and ElementZero must be the same.");
+      auto const& scales = cute::get<1>(partitioned_extra_info)(_, _, k_block);
+      auto const& zeros = cute::get<3>(partitioned_extra_info)(_, _, k_block);
+      CUTE_STATIC_ASSERT_V(size(src) == size(scales));
+      CUTE_STATIC_ASSERT_V(size(src) == size(zeros));
+      
+      if constexpr (is_same_v<DstType, ElementScale>) {
+        constexpr int pack = decltype(select_packing<SrcType, DstType, num_elements>::value())::value;
+        using Converter = cutlass::NumericArrayConverter<DstType, SrcType, pack, cutlass::FloatRoundStyle::round_to_nearest>;
+        using SrcArray = cutlass::Array<SrcType, pack>;
+        using DstArray = cutlass::Array<DstType, pack>;
+        constexpr int iters = num_elements / pack;
+
+        CUTLASS_PRAGMA_UNROLL
+        for (int i = 0; i < iters; ++i) {
+          SrcArray const* pSrcArr = reinterpret_cast<SrcArray const*>(pSrc) + i;
+          DstArray* pDstArr = reinterpret_cast<DstArray*>(pDst) + i;
+          *pDstArr = Converter::convert(*pSrcArr);
+          CUTLASS_PRAGMA_UNROLL
+          for (int j = 0; j < pack; ++j) {
+            (*pDstArr)[j] = (*pDstArr)[j] * scales[i*pack + j] + zeros[i*pack + j];
+          }
+        }
+      }
+      else {
+        constexpr int pack1 = decltype(select_packing<SrcType, ElementScale, num_elements>::value())::value;
+        constexpr int pack2 = decltype(select_packing<ElementScale, DstType, num_elements>::value())::value;
+        constexpr int pack = cute::gcd(pack1, pack2);
+        using Converter1 = cutlass::NumericArrayConverter<ElementScale, SrcType, pack, cutlass::FloatRoundStyle::round_to_nearest>;
+        using Converter2 = cutlass::NumericArrayConverter<DstType, ElementScale, pack, cutlass::FloatRoundStyle::round_to_nearest>;
+        using SrcArray = cutlass::Array<SrcType, pack>;
+        using DstArray = cutlass::Array<DstType, pack>;
+        using StageArray = cutlass::Array<ElementScale, pack>;
+        constexpr int iters = num_elements / pack;
+
+        CUTLASS_PRAGMA_UNROLL
+        for (int i = 0; i < iters; ++i) {
+          SrcArray const* pSrcArr = reinterpret_cast<SrcArray const*>(pSrc) + i;
+          DstArray* pDstArr = reinterpret_cast<DstArray*>(pDst) + i;
+          StageArray stageArr;
+          stageArr = Converter1::convert(*pSrcArr);
+          CUTLASS_PRAGMA_UNROLL
+          for (int j = 0; j < pack; ++j) {
+            stageArr[j] = stageArr[j] *  scales[i*pack + j] + zeros[i*pack + j];
+          }
+          *pDstArr = Converter2::convert(stageArr);
+        }
+      }
+      return;
     }
     else {
       static_assert(cutlass::detail::dependent_false<KernelSchedule>, "No A data is loaded.");
     }
   }
-
-  /// Utilities for transforming the A operand prior to issuing tensorcore math.
-  template <class EngineIn, 
-            class EngineOut, 
-            class TensorLayout,
-            int ConversionVectorWidth = cosize_v<TensorLayout>>
-  CUTLASS_DEVICE void
-  convert_tensor(
-    Tensor<EngineIn,TensorLayout> const& in, 
-    Tensor<EngineOut,TensorLayout>& out, 
-    cute::Int<ConversionVectorWidth> width = {}) {
-
-    /// This is an element-wise conversion where we expect both tensors to have the same layout.
-    /// As a result, we can cast as a cutlass array to use the fast numeric converters without 
-    /// worrying about indexing into the layout.
-    constexpr int N = cosize_v<TensorLayout>; 
-
-    /// The inputs must be backed by registers & be statically sized.
-    static_assert(is_rmem<EngineIn>::value, "Input tensor for A conversion must come from registers");
-    static_assert(is_rmem<EngineOut>::value, "Output tensor for A conversion must come from registers");
-    static_assert(is_static_v<TensorLayout>, "Tensor layout for the conversion must be static");
-    static_assert(cosize_v<TensorLayout> == size(TensorLayout{}), "Cosize and size of the layout must be equal.");
-    static_assert(N % ConversionVectorWidth == 0, "Conversion vector width must divide cosize of the tensor layout.");
-
-    using SrcType = typename EngineIn::value_type;
-    using DstType = typename EngineOut::value_type;
-  
-    using SrcArray = cutlass::Array<SrcType, ConversionVectorWidth>;
-    using DstArray = cutlass::Array<DstType, ConversionVectorWidth>;
-
-    constexpr cutlass::FloatRoundStyle RoundStyle = cutlass::FloatRoundStyle::round_to_nearest;
-    using Converter = cutlass::NumericArrayConverter<DstType, SrcType, ConversionVectorWidth, RoundStyle>;
-
-    constexpr int NumIterations = N / ConversionVectorWidth;
-
-    for (int ii = 0; ii < NumIterations; ++ii) {
-      SrcArray const* src_array_ptr = reinterpret_cast<SrcArray const*>(raw_pointer_cast(in.data())) + ii;
-      DstArray* dst_array_ptr = reinterpret_cast<DstArray*>(raw_pointer_cast(out.data())) + ii;
-      *dst_array_ptr = Converter::convert(*src_array_ptr);
-    }
-  }
-
-  template <class EngineIn, 
-            class EngineOut, 
-            class TensorLayout,
-            int A_VectorConversionWidth>
-  CUTLASS_DEVICE void
-  transform_internal_A(
-    Tensor<EngineIn,TensorLayout>&& in,
-    cute::Int<A_VectorConversionWidth> a_vec_width,
-    Tensor<EngineOut,TensorLayout>&& out) {
-
-    convert_tensor(in, out, a_vec_width);
-  }
-
-  template <class EngineIn, 
-            class EngineInputBuffer, 
-            class EngineScale, 
-            class EngineOut, 
-            class TensorLayout,
-            int A_VectorConversionWidth>
-  CUTLASS_DEVICE void
-  transform_internal_A(
-    Tensor<EngineIn,TensorLayout>&& in,
-    cute::Int<A_VectorConversionWidth> a_vec_width,
-    Tensor<EngineInputBuffer,TensorLayout>&& converted_inputs,
-    Tensor<EngineScale,TensorLayout>&& scales,
-    Tensor<EngineOut,TensorLayout>&& out) {
-
-    static_assert(cute::is_same_v<typename EngineInputBuffer::value_type, typename EngineScale::value_type>,  
-      "Type of the engine input buffer must equal the scale buffer");
-    
-    // First, we upcast the inputs to the scale type
-    convert_tensor(in, converted_inputs, a_vec_width);
-
-    // Apply scales and broadcast across inputs, store in converted_inputs
-    cute::transform(converted_inputs, scales, converted_inputs, cute::multiplies{});
-
-    // Finally, we convert the scaled inputs to the mma type.
-    convert_tensor(converted_inputs, out);
-  }
-
-  template <class EngineIn, 
-            class EngineInputBuffer, 
-            class EngineScale,
-            class EngineZero,
-            class EngineZeroBuffer,
-            class EngineOut, 
-            class TensorLayout,
-            int A_VectorConversionWidth>
-  CUTLASS_DEVICE void
-  transform_internal_A(
-    Tensor<EngineIn,TensorLayout>&& in,
-    cute::Int<A_VectorConversionWidth> a_vec_width,
-    Tensor<EngineInputBuffer,TensorLayout>&& converted_inputs,
-    Tensor<EngineScale,TensorLayout>&& scales,
-    Tensor<EngineZero,TensorLayout>&& zeros,
-    Tensor<EngineZeroBuffer,TensorLayout>&& converted_zeros,
-    Tensor<EngineOut,TensorLayout>&& out) {
-
-    static_assert(cute::is_same_v<typename EngineInputBuffer::value_type, typename EngineScale::value_type>,  
-      "Type of the engine input buffer must equal the scale buffer");
-
-    static_assert(cute::is_same_v<typename EngineZeroBuffer::value_type, typename EngineScale::value_type>,  
-      "Type of the engine zero buffer must equal the scale buffer");
-    
-    // First, we upcast the inputs to the scale type
-    convert_tensor(in, converted_inputs, a_vec_width);
-    convert_tensor(zeros, converted_zeros);
-
-    // Apply scales and broadcast across inputs, store in converted_inputs
-    cute::transform(converted_inputs, scales, converted_inputs, cute::multiplies{});
-    cute::transform(converted_inputs, converted_zeros, converted_inputs, cute::plus{});
-
-    // Finally, we convert the scaled inputs to the mma type.
-    convert_tensor(converted_inputs, out);
-  } 
 };
 
 /////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp b/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp
index 261546f420..d39b59cee5 100644
--- a/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp
+++ b/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized.hpp
@@ -150,7 +150,7 @@ struct CollectiveMma<
 
   struct SharedStorage
   {
-    struct TensorStorage : cute::aligned_struct<128> {
+    struct TensorStorage : cute::aligned_struct<128, _0> {
       cute::array_aligned<typename TiledMma::ValTypeA, cute::cosize_v<SmemLayoutA>> smem_A;
       cute::array_aligned<typename TiledMma::ValTypeB, cute::cosize_v<SmemLayoutB>> smem_B;
     } tensors;
diff --git a/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized_fp8.hpp b/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized_fp8.hpp
index b72629d695..9cc6c9ad46 100644
--- a/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized_fp8.hpp
+++ b/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss_warpspecialized_fp8.hpp
@@ -144,7 +144,7 @@ struct CollectiveMma<
 
   struct SharedStorage
   {
-    struct TensorStorage : cute::aligned_struct<128> {
+    struct TensorStorage : cute::aligned_struct<128, _0> {
       cute::array_aligned<typename TiledMma::ValTypeA, cute::cosize_v<SmemLayoutA>> smem_A;
       cute::array_aligned<typename TiledMma::ValTypeB, cute::cosize_v<SmemLayoutB>> smem_B;
     } tensors;
@@ -246,8 +246,8 @@ struct CollectiveMma<
     implementable = implementable && cutlass::detail::check_alignment<min_tma_aligned_elements_A>(cute::make_shape(M,K,L), StrideA{});
     constexpr int min_tma_aligned_elements_B = tma_alignment_bits / cutlass::sizeof_bits<ElementB>::value;
     implementable = implementable && cutlass::detail::check_alignment<min_tma_aligned_elements_B>(cute::make_shape(N,K,L), StrideB{});
-    /* MMA promotion interval should be a multiple of 4, since each mainloop iteration would issue 4 MMA instructions. */
-    implementable = implementable && (args.mma_promotion_interval % 4 == 0);
+    /* MMA promotion interval should be a multiple of the number of MMA instructions issued by each mainloop iteration. */
+    implementable = implementable && (args.mma_promotion_interval % (size<2>(TileShape{})() / TiledMma().template tile_size_mnk<2>()()) == 0);
 
     if (!implementable) {
       CUTLASS_TRACE_HOST("  CAN IMPLEMENT: Problem Size doesn't meet the minimum alignment requirements for TMA.\n");
diff --git a/include/cutlass/gemm/collective/sm90_sparse_mma_tma_gmma_ss_warpspecialized.hpp b/include/cutlass/gemm/collective/sm90_sparse_mma_tma_gmma_ss_warpspecialized.hpp
new file mode 100644
index 0000000000..01e83bdf54
--- /dev/null
+++ b/include/cutlass/gemm/collective/sm90_sparse_mma_tma_gmma_ss_warpspecialized.hpp
@@ -0,0 +1,724 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+#pragma once
+
+#include "cutlass/cutlass.h"
+#include "cutlass/gemm/collective/builders/sm90_sparse_config.inl"
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/numeric_types.h"
+#include "cutlass/pipeline/pipeline.hpp"
+#include "cutlass/trace.h"
+
+#include "cute/arch/cluster_sm90.hpp"
+#include "cute/arch/copy_sm90.hpp"
+#include "cute/algorithm/functional.hpp"
+#include "cute/atom/mma_atom.hpp"
+#include "cute/algorithm/gemm.hpp"
+#include "cute/tensor_predicate.hpp"
+#include "cute/numeric/arithmetic_tuple.hpp"
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+namespace cutlass::gemm::collective {
+using namespace cute;
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+// WarpSpecialized Mainloop
+template <
+  int Stages,
+  class ClusterShape,
+  class KernelSchedule,
+  class TileShape_,
+  class ElementA_,
+  class LayoutPairAE_,
+  class ElementB_,
+  class StrideB_,
+  class TiledMma_,
+  class GmemTiledCopyA_,
+  class SmemLayoutAtomA_,
+  class SmemCopyAtomA_,
+  class TransformA_,
+  class GmemTiledCopyB_,
+  class SmemLayoutAtomB_,
+  class SmemCopyAtomB_,
+  class TransformB_>
+struct CollectiveMma<
+    MainloopSm90TmaGmmaWarpSpecializedSparse<Stages, ClusterShape, KernelSchedule>,
+    TileShape_,
+    ElementA_,
+    LayoutPairAE_,
+    ElementB_,
+    StrideB_,
+    TiledMma_,
+    GmemTiledCopyA_,
+    SmemLayoutAtomA_,
+    SmemCopyAtomA_,
+    TransformA_,
+    GmemTiledCopyB_,
+    SmemLayoutAtomB_,
+    SmemCopyAtomB_,
+    TransformB_>
+{
+  //
+  // Type Aliases
+  //
+  using DispatchPolicy = MainloopSm90TmaGmmaWarpSpecializedSparse<Stages, ClusterShape, KernelSchedule>;
+  using TileShape = TileShape_;
+  using TiledMma = TiledMma_;
+  using ElementA = ElementA_;
+  using ElementAMma = typename TiledMma::ValTypeA;
+  using ElementAMmaRaw = typename ElementAMma::raw_type;
+  using LayoutPairAE = LayoutPairAE_;
+  using LayoutA = remove_cvref_t<decltype(get<0>(LayoutPairAE{}))>;
+  using LayoutE = remove_cvref_t<decltype(get<1>(LayoutPairAE{}))>;
+  using StrideA = decltype(cute::stride(LayoutA{}));
+  using ElementB = ElementB_;
+  using ElementBMma = typename TiledMma::ValTypeB;
+  using StrideB = StrideB_;
+  using ElementEMma = typename TiledMma::ValTypeE;
+  using ElementE = typename ElementEMma::raw_type;
+  using ElementAccumulator = typename TiledMma::ValTypeC;
+  using GmemTiledCopyA = GmemTiledCopyA_;
+  using GmemTiledCopyB = GmemTiledCopyB_;
+  using SmemLayoutAtomA = SmemLayoutAtomA_;
+  using SmemLayoutAtomB = SmemLayoutAtomB_;
+  using SmemCopyAtomA = SmemCopyAtomA_;
+  using SmemCopyAtomB = SmemCopyAtomB_;
+  using TransformA = TransformA_;
+  using TransformB = TransformB_;
+  using ArchTag = typename DispatchPolicy::ArchTag;
+  using ArrayElementA = ElementA;
+  using ArrayElementB = ElementB;
+
+  static_assert(is_sparse<ElementAMma>::value, "ElementAMma is sparse");
+  static_assert(!is_sparse<ElementA>::value, "ElementA is not sparse");
+
+  static constexpr int ElementAMmaSparsity = ElementAMma::sparsity;
+  static constexpr int ElementEMmaSparsity = ElementEMma::sparsity;
+
+  // LayoutA is nested in the stride due to the sparsity.
+  static constexpr bool is_A_mn_major = cute::is_same_v<decltype(get<0>(LayoutA{}.stride())), Int<ElementAMmaSparsity>>;
+  static constexpr bool is_B_mn_major = cutlass::gemm::detail::is_major<0,StrideB>();
+
+  using SparseConfig = cutlass::Sm90GemmSparseConfig<ElementAMma,
+                                                     (is_A_mn_major ? GMMA::Major::MN : GMMA::Major::K),
+                                                     ElementEMma,
+                                                     decltype(cute::min(size<2>(TileShape{}),_128{}))>;
+
+  // The offline permutation for the metadata.
+  using SmemLayoutAtomE_ = typename SparseConfig::TensorEAtom;
+  using SmemLayoutAtomE  = ComposedLayout<Swizzle<0,4,3>,
+                                          smem_sparse_ptr_flag_bits<ElementEMmaSparsity, sizeof_bits_v<ElementE>>,
+                                          SmemLayoutAtomE_>;
+
+  // Metadata pathways
+  using SmemCopyAtomE = AutoVectorizingCopy;
+  using GmemCopyAtomE = GmemTiledCopyA;
+
+  using CtaShape_MNK = TileShape;
+  using MainloopPipeline = cutlass::PipelineTmaAsync<DispatchPolicy::Stages>;
+  using PipelineState = cutlass::PipelineState<DispatchPolicy::Stages>;
+
+  using PipelineParams = typename MainloopPipeline::Params;
+
+  static_assert(cute::rank(SmemLayoutAtomA{}) == 2, "SmemLayoutAtom must be rank 2 (M,K)");
+  static_assert((size<0>(TileShape{}) % size<0>(SmemLayoutAtomA{})) == 0, "SmemLayoutAtom must evenly divide tile shape.");
+  static_assert((size<2>(TileShape{}) % size<1>(SmemLayoutAtomA{})) == 0, "SmemLayoutAtom must evenly divide tile shape.");
+
+  static_assert(cute::rank(SmemLayoutAtomB{}) == 2, "SmemLayoutAtom must be rank 2 (N,K)");
+  static_assert((size<1>(TileShape{}) % size<0>(SmemLayoutAtomB{})) == 0, "SmemLayoutAtom must evenly divide tile shape.");
+  static_assert((size<2>(TileShape{}) % size<1>(SmemLayoutAtomB{})) == 0, "SmemLayoutAtom must evenly divide tile shape.");
+
+  // Tile along modes in a way that maximizes the TMA box size.
+  using SmemLayoutA = decltype(tile_to_shape(
+      SmemLayoutAtomA{},
+      make_shape(shape<0>(TileShape{}), shape<2>(TileShape{}), Int<DispatchPolicy::Stages>{}),
+      cute::conditional_t<is_A_mn_major, Step<_2,_1,_3>, Step<_1,_2,_3>>{}));
+  using SmemLayoutE = decltype(tile_to_shape(
+      SmemLayoutAtomE{},
+      make_shape(shape<0>(TileShape{}), shape<2>(TileShape{}), Int<DispatchPolicy::Stages>{})));
+  using SmemLayoutB = decltype(tile_to_shape(
+      SmemLayoutAtomB{},
+      make_shape(shape<1>(TileShape{}), shape<2>(TileShape{}), Int<DispatchPolicy::Stages>{}),
+      cute::conditional_t<is_B_mn_major, Step<_2,_1,_3>, Step<_1,_2,_3>>{}));
+
+  static_assert(DispatchPolicy::Stages >= 2, "Specialization requires Stages set to value 2 or more.");
+  static_assert(cute::is_base_of<cute::GMMA::DescriptorIterator, typename TiledMma::FrgTypeA>::value &&
+                cute::is_base_of<cute::GMMA::DescriptorIterator, typename TiledMma::FrgTypeB>::value,
+                "MMA atom must source both A and B operand from smem_desc for this mainloop.");
+  static_assert(cute::is_same_v<GmemTiledCopyA, SM90_TMA_LOAD> || cute::is_same_v<GmemTiledCopyA, SM90_TMA_LOAD_MULTICAST>,
+      "GmemTiledCopy - invalid SM90 TMA copy atom specified.");
+  static_assert(cute::is_same_v<GmemTiledCopyB, SM90_TMA_LOAD> || cute::is_same_v<GmemTiledCopyB, SM90_TMA_LOAD_MULTICAST>,
+      "GmemTiledCopy - invalid SM90 TMA copy atom specified.");
+
+  static_assert(cute::is_void_v<SmemCopyAtomA>,
+    "SM90 GMMA mainloops cannot have a non-void copy atom for smem sourced instructions.");
+  static_assert(cute::is_void_v<SmemCopyAtomB>,
+    "SM90 GMMA mainloops cannot have a non-void copy atom for smem sourced instructions.");
+
+  // TMA converts f32 input to tf32 when copying from GMEM to SMEM
+  // For all other types, cast to size equivalent uint type to avoid any rounding by TMA.
+  using TmaInternalElementA = cute::sparse_elem<ElementAMmaSparsity,
+                                                cute::conditional_t<cute::is_same_v<ElementA, float>,
+                                                                    cutlass::tfloat32_t,
+                                                                    uint_bit_t<sizeof_bits_v<ElementAMmaRaw>>>>;
+  using TmaInternalElementB = cute::conditional_t<cute::is_same_v<float, ElementB>, 
+                                                  tfloat32_t,
+                                                  uint_bit_t<sizeof_bits_v<ElementBMma>>>;
+
+  struct SharedStorage
+  {
+    struct TensorStorage {
+      alignas(128) cute::ArrayEngine<ElementAMma, cute::cosize_v<SmemLayoutA>> smem_A;
+      alignas(128) cute::ArrayEngine<ElementBMma, cute::cosize_v<SmemLayoutB>> smem_B;
+      alignas(128) cute::ArrayEngine<ElementEMma, cute::cosize_v<SmemLayoutE>> smem_E;
+    } tensors;
+
+    using PipelineStorage = typename MainloopPipeline::SharedStorage;
+    PipelineStorage pipeline;
+  };
+  using TensorStorage = typename SharedStorage::TensorStorage;
+  using PipelineStorage = typename SharedStorage::PipelineStorage;
+
+  static constexpr int K_PIPE_MAX = DispatchPolicy::Stages;
+  static constexpr int K_PIPE_MMAS = 0;
+
+  static constexpr uint32_t TmaTransactionBytes =
+        cutlass::bits_to_bytes(cosize(take<0,2>(SmemLayoutA{})) * cute::sizeof_bits_v<ElementAMma>) +
+        cutlass::bits_to_bytes(cosize(take<0,2>(SmemLayoutE{})) * cute::sizeof_bits_v<ElementEMma>) +
+        cutlass::bits_to_bytes(cosize(take<0,2>(SmemLayoutB{})) * cute::sizeof_bits_v<ElementBMma>);
+
+  // Host side kernel arguments
+  struct Arguments {
+    ElementA const* ptr_A{};
+    LayoutA layout_a{};
+    ElementB const* ptr_B{};
+    StrideB dB{};
+    ElementE const* ptr_E{};
+    LayoutE layout_e{};
+  };
+
+  // Device side kernel params
+  struct Params {
+
+    using TMA_A = decltype(make_tma_copy<typename TmaInternalElementA::raw_type>(
+        GmemTiledCopyA{},
+        make_tensor(recast_ptr<TmaInternalElementA>(nullptr), LayoutA{}),
+        SmemLayoutA{}(_,_,cute::Int<0>{}),
+        make_shape(shape<0>(TileShape{}), shape<2>(TileShape{})),
+        size<1>(ClusterShape{})));  // mcast along N mode for this M load, if any
+
+    using TMA_E = decltype(make_tma_copy<uint64_t>( // use uint64_t to get the largest loading box.
+        GmemCopyAtomE{},
+        make_tensor(recast_ptr<sparse_elem<ElementEMmaSparsity, ElementE>>(nullptr), LayoutE{}),
+        SmemLayoutE{}(_,_,cute::Int<0>{}),
+        make_shape(shape<0>(TileShape{}), shape<2>(TileShape{})),
+        size<1>(ClusterShape{})));  // mcast along N mode for this M load, if any
+
+    using TMA_B = decltype(make_tma_copy<TmaInternalElementB>(
+        GmemTiledCopyB{},
+        make_tensor(static_cast<TmaInternalElementB const*>(nullptr), repeat_like(StrideB{}, int32_t(0)), StrideB{}),
+        SmemLayoutB{}(_,_,cute::Int<0>{}),
+        make_shape(shape<1>(TileShape{}), shape<2>(TileShape{})),
+        size<0>(ClusterShape{}))); // mcast along M mode for this N load, if any
+
+    TMA_A tma_load_a;
+    TMA_E tma_load_e;
+    TMA_B tma_load_b;
+    LayoutA layout_a;
+    LayoutE layout_e;
+    uint32_t tma_transaction_bytes = TmaTransactionBytes;
+  };
+
+  //
+  // Methods
+  //
+
+  template <class ProblemShape>
+  static constexpr Params
+  to_underlying_arguments(ProblemShape const& problem_shape, Arguments const& args, void* workspace) {
+    (void) workspace;
+
+    // Optionally append 1s until problem shape is rank-4 (MNKL), in case it is only rank-3 (MNK)
+    auto problem_shape_MNKL = append<4>(problem_shape, 1);
+    auto [M,N,K,L] = problem_shape_MNKL;
+
+    auto ptr_A = recast_ptr<TmaInternalElementA>(args.ptr_A);
+    auto ptr_B = recast_ptr<TmaInternalElementB>(args.ptr_B);
+    auto ptr_E = recast_ptr<sparse_elem<ElementEMmaSparsity, ElementE>>(args.ptr_E);
+
+    Tensor tensor_a = make_tensor(ptr_A, args.layout_a);
+    Tensor tensor_b = make_tensor(ptr_B, make_layout(make_shape(N,K,L), args.dB));
+    Tensor tensor_e = make_tensor(ptr_E, args.layout_e);
+
+    typename Params::TMA_A tma_load_a = make_tma_copy<typename TmaInternalElementA::raw_type>(
+        GmemTiledCopyA{},
+        tensor_a,
+        SmemLayoutA{}(_,_,cute::Int<0>{}),
+        make_shape(shape<0>(TileShape{}), shape<2>(TileShape{})),
+        size<1>(ClusterShape{})); // mcast along N mode for this M load, if any
+
+    typename Params::TMA_E tma_load_e = make_tma_copy<uint64_t>( // use uint64_t to get the largest loading box.
+        GmemCopyAtomE{},
+        tensor_e,
+        SmemLayoutE{}(_,_,cute::Int<0>{}),
+        make_shape(shape<0>(TileShape{}), shape<2>(TileShape{})),
+        size<1>(ClusterShape{})); // mcast along N mode for this M load, if any
+
+    typename Params::TMA_B tma_load_b = make_tma_copy<TmaInternalElementB>(
+        GmemTiledCopyB{},
+        tensor_b,
+        SmemLayoutB{}(_,_,cute::Int<0>{}),
+        make_shape(shape<1>(TileShape{}), shape<2>(TileShape{})),
+        size<0>(ClusterShape{})); // mcast along M mode for this N load, if any
+
+    return {
+      tma_load_a,
+      tma_load_e,
+      tma_load_b,
+      args.layout_a,
+      args.layout_e
+    };
+  }
+
+  template<class ProblemShape>
+  CUTLASS_HOST_DEVICE static bool
+  can_implement(
+      ProblemShape const& problem_shape,
+      [[maybe_unused]] Arguments const& args) {
+    constexpr int tma_alignment_bits = 128;
+    constexpr int min_tma_aligned_elements_A = tma_alignment_bits / cutlass::sizeof_bits<ElementA>::value;
+    constexpr int min_tma_aligned_elements_B = tma_alignment_bits / cutlass::sizeof_bits<ElementB>::value;
+    auto problem_shape_MNKL = append<4>(problem_shape, 1);
+    auto [M,N,K,L] = problem_shape_MNKL;
+    
+    bool size_check = true;
+    // Check Alignment A
+    if constexpr (is_A_mn_major) {
+      size_check = size_check && cutlass::detail::check_alignment<min_tma_aligned_elements_A>(cute::make_shape(M,K/2,L), cute::make_stride(_1{}, M, M*K/2));
+    }
+    else { // If A is K-major
+      size_check = size_check && cutlass::detail::check_alignment<min_tma_aligned_elements_A>(cute::make_shape(M,K/2,L), cute::make_stride(K/2, _1{}, M*K/2));
+    }
+    size_check = size_check && cutlass::detail::check_alignment<min_tma_aligned_elements_B>(cute::make_shape(N,K,L), StrideB{});
+
+    if (!size_check) {
+      CUTLASS_TRACE_HOST("  CAN IMPLEMENT: Problem Size doesn't meet the minimum alignment requirements for TMA.\n");
+    }
+
+    // Check if layout_a and layout_e is filled correctly
+    auto layout_a_ref = SparseConfig::fill_layoutA(problem_shape_MNKL);
+    auto layout_e_ref = SparseConfig::fill_layoutE(problem_shape_MNKL);
+    bool layout_check = true;
+    layout_check = layout_check && (layout_a_ref == args.layout_a);
+    layout_check = layout_check && (layout_e_ref == args.layout_e);
+
+    if (!layout_check) {
+      CUTLASS_TRACE_HOST("  CAN IMPLEMENT: Layout_a/e mismatch.\n");
+    }
+
+    return size_check && layout_check;
+  }
+
+  /// Issue Tma Descriptor Prefetch -- ideally from a single thread for best performance
+  CUTLASS_DEVICE
+  static void prefetch_tma_descriptors(Params const& mainloop_params) {
+    cute::prefetch_tma_descriptor(mainloop_params.tma_load_a.get_tma_descriptor());
+    cute::prefetch_tma_descriptor(mainloop_params.tma_load_e.get_tma_descriptor());
+    cute::prefetch_tma_descriptor(mainloop_params.tma_load_b.get_tma_descriptor());
+  }
+
+  /// Set up the data needed by this collective for load and mma.
+  /// Returns a tuple of tensors. The collective and the kernel layer have the contract
+  /// Returned tuple must contain at least two elements, with the first two elements being:
+  /// gA_mkl - The tma tensor, A after a local tile so it has shape  (BLK_M,BLK_K,m,k,l)
+  /// gB_nkl - The tma tensor, B after a local tile so it has shape  (BLK_N,BLK_K,n,k,l)
+  /// The rest of the tensors can be specified as needed by this collective.
+  template <class ProblemShape_MNKL>
+  CUTLASS_DEVICE auto
+  load_init(ProblemShape_MNKL const& problem_shape_MNKL, Params const& mainloop_params) const {
+    using X = Underscore;
+    // Separate out problem shape for convenience
+    auto [M,N,K,L] = problem_shape_MNKL;
+
+    // TMA requires special handling of strides to deal with coord codomain mapping
+    // Represent the full tensors -- get these from TMA
+    Tensor mA_mkl = mainloop_params.tma_load_a.get_tma_tensor(mainloop_params.layout_a.shape());                      // (m,k,l)
+    Tensor mE_mkl = mainloop_params.tma_load_e.get_tma_tensor(mainloop_params.layout_e.shape());                      // (m,k,l)
+    Tensor mB_nkl = mainloop_params.tma_load_b.get_tma_tensor(make_shape(N,K,L));                            // (n,k,l)
+
+    // Make tiled views, defer the slice
+    Tensor gA_mkl = local_tile(mA_mkl, TileShape{}, make_coord(_,_,_), Step<_1, X,_1>{});        // (BLK_M,BLK_K,m,k,l)
+    Tensor gE_mkl = local_tile(mE_mkl, TileShape{}, make_coord(_,_,_), Step<_1, X,_1>{});        // (BLK_M,BLK_K,m,k,l)
+    Tensor gB_nkl = local_tile(mB_nkl, TileShape{}, make_coord(_,_,_), Step< X,_1,_1>{});        // (BLK_N,BLK_K,n,k,l)
+
+    return cute::make_tuple(gA_mkl, gB_nkl, gE_mkl);
+  }
+
+  /// Perform a collective-scoped matrix multiply-accumulate
+  /// Producer Perspective
+  template <
+    class TensorA, class TensorB, class TensorE,
+    class KTileIterator, class BlockCoord
+  >
+  CUTLASS_DEVICE void
+  load(
+      Params const& mainloop_params,
+      MainloopPipeline pipeline, 
+      PipelineState smem_pipe_write,
+      cute::tuple<TensorA, TensorB, TensorE> const& load_inputs,
+      BlockCoord const& blk_coord,
+      KTileIterator k_tile_iter, int k_tile_count,
+      int thread_idx,
+      uint32_t block_rank_in_cluster,
+      TensorStorage& shared_tensors) {
+    int lane_predicate = cute::elect_one_sync();
+
+    if (lane_predicate) {
+      Tensor sA = make_tensor(make_smem_ptr(shared_tensors.smem_A.begin()), SmemLayoutA{});        // (BLK_M,BLK_K,PIPE)
+      Tensor sE = make_tensor(make_smem_ptr(shared_tensors.smem_E.begin()), SmemLayoutE{});        // (BLK_M,BLK_K,PIPE)
+      Tensor sB = make_tensor(make_smem_ptr(shared_tensors.smem_B.begin()), SmemLayoutB{});        // (BLK_N,BLK_K,PIPE)
+
+      auto [gA_mkl, gB_nkl, gE_mkl] = load_inputs;
+
+      // Define the CTA-in-cluster Layout and Coord
+      Layout cta_layout_mnk = make_layout(ClusterShape{});
+      auto cta_coord_mnk = cta_layout_mnk.get_flat_coord(block_rank_in_cluster);
+
+      // TMA Multicast Masks
+      uint16_t mcast_mask_a = create_tma_multicast_mask<1>(cta_layout_mnk, cta_coord_mnk);
+      uint16_t mcast_mask_e = create_tma_multicast_mask<1>(cta_layout_mnk, cta_coord_mnk);
+      uint16_t mcast_mask_b = create_tma_multicast_mask<0>(cta_layout_mnk, cta_coord_mnk);
+
+      auto block_tma_a = mainloop_params.tma_load_a.get_slice(get<1>(cta_coord_mnk));
+      auto block_tma_e = mainloop_params.tma_load_e.get_slice(get<1>(cta_coord_mnk));
+      auto block_tma_b = mainloop_params.tma_load_b.get_slice(get<0>(cta_coord_mnk));
+
+      // Partition the inputs based on the current block coordinates.
+      auto [m_coord, n_coord, k_coord, l_coord] = blk_coord;
+      Tensor gA = gA_mkl(_,_,m_coord,_,l_coord);                                                     // (BLK_M,BLK_K,k)
+      Tensor gE = gE_mkl(_,_,m_coord,_,l_coord);                                                     // (BLK_M,BLK_K,k)
+      Tensor gB = gB_nkl(_,_,n_coord,_,l_coord);                                                     // (BLK_N,BLK_K,k)
+
+      // Applies the mapping from block_tma_a
+      Tensor tAgA = block_tma_a.partition_S(gA);                                                 // (TMA,TMA_M,TMA_K,k)
+      Tensor tAsA = block_tma_a.partition_D(sA);                                              // (TMA,TMA_M,TMA_K,PIPE)
+
+      Tensor tEgE = block_tma_e.partition_S(gE);                                                 // (TMA,TMA_M,TMA_K,k)
+      Tensor tEsE = block_tma_e.partition_D(sE);                                              // (TMA,TMA_M,TMA_K,PIPE)
+
+      Tensor tBgB = block_tma_b.partition_S(gB);                                                 // (TMA,TMA_N,TMA_K,k)
+      Tensor tBsB = block_tma_b.partition_D(sB);                                              // (TMA,TMA_N,TMA_K,PIPE)
+
+      // Mainloop
+      CUTLASS_PRAGMA_NO_UNROLL
+      for ( ; k_tile_count > 0; --k_tile_count)
+      {
+        // LOCK smem_pipe_write for _writing_
+        pipeline.producer_acquire(smem_pipe_write);
+
+        //
+        // Copy gmem to smem for *k_tile_iter
+        //
+
+        using BarrierType = typename MainloopPipeline::ProducerBarrierType;
+        BarrierType* tma_barrier = pipeline.producer_get_barrier(smem_pipe_write);
+
+        int write_stage = smem_pipe_write.index();
+        copy(mainloop_params.tma_load_a.with(*tma_barrier, mcast_mask_a), tAgA(_,_,_,*k_tile_iter), tAsA(_,_,_,write_stage));
+        copy(mainloop_params.tma_load_e.with(*tma_barrier, mcast_mask_e), tEgE(_,_,_,*k_tile_iter), tEsE(_,_,_,write_stage));
+        copy(mainloop_params.tma_load_b.with(*tma_barrier, mcast_mask_b), tBgB(_,_,_,*k_tile_iter), tBsB(_,_,_,write_stage));
+        ++k_tile_iter;
+
+        // Advance smem_pipe_write
+        ++smem_pipe_write;
+      }
+    }
+  }
+
+  /// Perform a Producer Epilogue to prevent early exit of blocks in a Cluster
+  CUTLASS_DEVICE void
+  load_tail(MainloopPipeline pipeline, PipelineState smem_pipe_write) {
+    int lane_predicate = cute::elect_one_sync();
+
+    // Issue the epilogue waits
+    if (lane_predicate) {
+      /* This helps avoid early exit of blocks in Cluster
+       * Waits for all stages to either be released (all 
+       * Consumer UNLOCKs), or if the stage was never used
+       * then would just be acquired since the phase was 
+       * still inverted from make_producer_start_state
+       */
+      pipeline.producer_tail(smem_pipe_write);
+    }
+  }
+
+  /// Perform a collective-scoped matrix multiply-accumulate
+  /// Consumer Perspective
+  template <
+    class FrgTensorC
+  >
+  CUTLASS_DEVICE void
+  mma(MainloopPipeline pipeline,
+      PipelineState smem_pipe_read,
+      FrgTensorC& accum,
+      int k_tile_count,
+      int thread_idx,
+      TensorStorage& shared_tensors,
+      Params const& mainloop_params) {
+    static_assert(is_rmem<FrgTensorC>::value, "C tensor must be rmem resident.");
+    static_assert(cute::rank(SmemLayoutA{}) == 3, "Smem layout must be rank 3.");
+    static_assert(cute::rank(SmemLayoutE{}) == 3, "Smem layout must be rank 3.");
+    static_assert(cute::rank(SmemLayoutB{}) == 3, "Smem layout must be rank 3.");
+
+    Tensor sA = make_tensor(make_smem_ptr(shared_tensors.smem_A.begin()), SmemLayoutA{});          // (BLK_M,BLK_K,PIPE)
+    Tensor sB = make_tensor(make_smem_ptr(shared_tensors.smem_B.begin()), SmemLayoutB{});          // (BLK_N,BLK_K,PIPE)
+
+    Tensor sE_ = make_tensor(make_smem_ptr(shared_tensors.smem_E.begin()), SmemLayoutE{});         // (BLK_M,BLK_K,PIPE)
+    Tensor sE = as_position_independent_swizzle_tensor(sE_);
+
+    //
+    // Define C accumulators and A/B partitioning
+    //
+
+    TiledMma tiled_mma;
+    auto thread_mma = tiled_mma.get_thread_slice(thread_idx);
+
+    Tensor tCsA = thread_mma.partition_A(sA);                                                 // (MMA,MMA_M,MMA_K,PIPE)
+    Tensor tCsB = thread_mma.partition_B(sB);                                                 // (MMA,MMA_N,MMA_K,PIPE)
+
+    // Allocate "fragments/descriptors"
+    Tensor tCrA = thread_mma.make_fragment_A(tCsA);                                           // (MMA,MMA_M,MMA_K,PIPE)
+    Tensor tCrB = thread_mma.make_fragment_B(tCsB);                                           // (MMA,MMA_N,MMA_K,PIPE)
+
+    CUTE_STATIC_ASSERT_V(size<1>(tCsA) == size<1>(accum));                                                         // M
+    CUTE_STATIC_ASSERT_V(size<1>(tCsB) == size<2>(accum));                                                         // N
+    CUTE_STATIC_ASSERT_V(size<2>(tCsA) == size<2>(tCsB));                                                          // K
+    CUTE_STATIC_ASSERT_V(size<3>(tCsA) == size<3>(tCsB));                                                       // PIPE
+    CUTE_STATIC_ASSERT_V(Int<DispatchPolicy::Stages>{} == size<2>(sA));                                         // PIPE
+    CUTE_STATIC_ASSERT_V(Int<DispatchPolicy::Stages>{} == size<2>(sB));                                         // PIPE
+
+    auto copy_atom_E = Copy_Atom<SmemCopyAtomE, uint32_t>{};
+
+    Tensor tCsE = partition_E(thread_mma, sE(_,_,Int<0>{}));            // (MMA,MMA_M,MMA_K)
+    Tensor tCrE = make_fragment_like<ElementEMma>(tCsE);                // (MMA,MMA_M,MMA_K)
+
+    auto smem_tiled_copy_E = make_tiled_copy_E(copy_atom_E, tiled_mma);
+    auto smem_thr_copy_E   = smem_tiled_copy_E.get_thread_slice(thread_idx);
+
+    Tensor tEsE  = smem_thr_copy_E.partition_S(sE);                     // (ECPY,ECPY_M,ECPY_K)
+    Tensor tErE  = smem_thr_copy_E.retile_D(tCrE);                      // (ECPY,ECPY_M,ECPY_K)
+
+    //
+    // PIPELINED MAIN LOOP
+    //
+    static_assert((0 <= K_PIPE_MMAS) && (K_PIPE_MMAS <  K_PIPE_MAX),
+        "ERROR : Incorrect number of MMAs in flight");
+
+    // We release buffers to producer warps(dma load) with some mmas in flight
+    PipelineState smem_pipe_release = smem_pipe_read;
+
+    // Prologue GMMAs
+    int prologue_mma_count = min(K_PIPE_MMAS, k_tile_count);
+
+    tiled_mma.accumulate_ = GMMA::ScaleOut::Zero;
+
+    warpgroup_fence_operand(accum);
+    CUTLASS_PRAGMA_UNROLL
+    for (int k_tile_prologue = prologue_mma_count; k_tile_prologue > 0; --k_tile_prologue)
+    {
+      // WAIT on smem_pipe_read until its data are available (phase bit flips from rdPhaseBit value)
+      auto barrier_token = pipeline.consumer_try_wait(smem_pipe_read);
+      pipeline.consumer_wait(smem_pipe_read, barrier_token);
+      int read_stage = smem_pipe_read.index();
+
+      // Load metadata smem->rmem for one stage
+      copy(smem_tiled_copy_E, tEsE(_,_,_,read_stage), tErE);
+
+      warpgroup_arrive();
+      // Unroll the K mode manually to set scale D to 1
+      CUTLASS_PRAGMA_UNROLL
+      for (int k_block = 0; k_block < size<2>(tCrA); ++k_block) {
+        cute::gemm(tiled_mma, make_zip_tensor(tCrA(_,_,k_block,read_stage), tErE(_,_,k_block)), tCrB(_,_,k_block,read_stage), accum);
+        tiled_mma.accumulate_ = GMMA::ScaleOut::One;
+      }
+
+      warpgroup_commit_batch();
+
+      ++smem_pipe_read;
+    }
+
+    warpgroup_fence_operand(accum);
+    // Mainloop GMMAs
+    k_tile_count -= prologue_mma_count;
+
+    CUTLASS_PRAGMA_NO_UNROLL
+    for ( ; k_tile_count > 0; --k_tile_count)
+    {
+      // WAIT on smem_pipe_read until its data are available (phase bit flips from rdPhaseBit value)
+      auto barrier_token = pipeline.consumer_try_wait(smem_pipe_read);
+      pipeline.consumer_wait(smem_pipe_read, barrier_token);
+      int read_stage = smem_pipe_read.index();
+
+      // Load metadata smem->rmem for one stage
+      copy(smem_tiled_copy_E, tEsE(_,_,_,read_stage), tErE);
+
+      warpgroup_fence_operand(accum);
+      warpgroup_arrive();
+      // Unroll the K mode manually to set scale D to 1
+      CUTLASS_PRAGMA_UNROLL
+      for (int k_block = 0; k_block < size<2>(tCrA); ++k_block) {
+        cute::gemm(tiled_mma, make_zip_tensor(tCrA(_,_,k_block,read_stage), tErE(_,_,k_block)), tCrB(_,_,k_block,read_stage), accum);
+        tiled_mma.accumulate_ = GMMA::ScaleOut::One;
+      }
+      warpgroup_commit_batch();
+
+      /// Wait on the GMMA barrier for K_PIPE_MMAS (or fewer) outstanding to ensure smem_pipe_write is consumed
+      warpgroup_wait<K_PIPE_MMAS>();
+      warpgroup_fence_operand(accum);
+
+      // UNLOCK smem_pipe_release, done _computing_ on it
+      pipeline.consumer_release(smem_pipe_release);
+
+      // Advance smem_pipe_read and smem_pipe_release
+      ++smem_pipe_read;
+      ++smem_pipe_release;
+    }
+
+    warpgroup_fence_operand(accum);
+  }
+
+  /// Perform a Consumer Epilogue to release all buffers
+  CUTLASS_DEVICE void
+  mma_tail(MainloopPipeline pipeline, PipelineState smem_pipe_release, int k_tile_count) {
+    // Prologue GMMAs
+    int prologue_mma_count = min(K_PIPE_MMAS, k_tile_count);
+    k_tile_count -= prologue_mma_count;
+
+    smem_pipe_release.advance(k_tile_count);
+    
+    // Wait on all GMMAs to complete
+    warpgroup_wait<0>();
+
+    for (int count = 0; count < prologue_mma_count; ++count) {
+      pipeline.consumer_release(smem_pipe_release);                 // UNLOCK smem_pipe_release, done _computing_ on it
+      ++smem_pipe_release;
+    }
+  }
+
+private:
+
+  template <class MMA_Atom,
+            class AtomLayoutMNK,
+            class PermutationMNK,
+            class ETensor>
+  CUTE_HOST_DEVICE static constexpr
+  auto
+  thrfrg_E(TiledMMA<MMA_Atom, AtomLayoutMNK, PermutationMNK> const& mma, ETensor&& etensor)
+  {
+    using TiledMma = TiledMMA<MMA_Atom, AtomLayoutMNK, PermutationMNK>;
+
+    CUTE_STATIC_ASSERT_V(rank(etensor) >= Int<2>{});
+
+    // Reorder the tensor for the TiledAtom
+    auto t_tile = make_tile(get<0>(PermutationMNK{}),
+                            get<2>(PermutationMNK{}));
+    auto t_tensor = logical_divide(etensor, t_tile);                 // (PermM,PermK)
+
+    // Tile the tensor for the Atom
+    auto e_tile = make_tile(make_layout(size<0>(typename TiledMma::AtomShape_MNK{})),
+                            make_layout(size<2>(typename TiledMma::AtomShape_MNK{})));
+    auto e_tensor = zipped_divide(t_tensor, e_tile);                 // ((AtomM,AtomK),(RestM,RestK))
+
+    // Transform the Atom mode from (M,K) to (Thr,Val)
+    using AtomLayoutE_TV = typename TiledMma::Atom::Traits::ELayout;
+    auto tv_tensor = e_tensor.compose(AtomLayoutE_TV{},_);           // ((ThrV,FrgV),(RestM,RestK))
+
+    // Tile the tensor for the Thread
+    auto thr_tile = make_tile(_,
+                              make_tile(make_layout(size<1>(mma.thr_layout_vmnk_)),
+                                        make_layout(size<3>(mma.thr_layout_vmnk_))));
+    auto thr_tensor = zipped_divide(tv_tensor, thr_tile);            // ((ThrV,(ThrM,ThrK)),(FrgV,(RestM,RestK)))
+
+    return thr_tensor;
+  }
+
+  template<class... MArgs>
+  CUTE_HOST_DEVICE static constexpr
+  auto
+  get_layoutE_TV(TiledMMA<MArgs...> const& mma)
+  {
+    // (M,K) -> (M,K)
+    auto ref_E = make_layout(make_shape(tile_size<0>(mma), tile_size<2>(mma)));
+    // (ethrid,val) -> (M,K)
+    auto layoutE_TV = thrfrg_E(mma, ref_E);
+
+    // (ThrV,(ThrM,ThrK)) -> (ThrV,(ThrM,ThrN,ThrK))
+    auto etile = make_tile(_,
+                            make_tile(make_layout(make_shape (size<1>(mma.thr_layout_vmnk_), size<2>(mma.thr_layout_vmnk_)),
+                                                  make_stride(               Int<1>{} ,                Int<0>{} )),
+                                      _));
+
+    // thr_idx -> (ThrV,ThrM,ThrN,ThrK)
+    auto thridx_2_thrid = right_inverse(mma.thr_layout_vmnk_);
+
+    // (thr_idx,val) -> (M,K)
+    return layoutE_TV.compose(etile, _).compose(thridx_2_thrid, _);
+  }
+
+  template <class... MArgs, class ETensor>
+  CUTE_HOST_DEVICE static constexpr
+  auto
+  partition_E(ThrMMA<MArgs...> const& thr_mma, ETensor&& etensor)
+  {
+    auto thr_tensor = make_tensor(static_cast<ETensor&&>(etensor).data(), thrfrg_E(thr_mma, etensor.layout()));
+
+    auto thr_vmk = make_coord(get<0>(thr_mma.thr_vmnk_), make_coord(get<1>(thr_mma.thr_vmnk_), get<3>(thr_mma.thr_vmnk_)));
+    return thr_tensor(thr_vmk, make_coord(_, repeat<rank<1,1>(thr_tensor)>(_)));
+  }
+
+  template <class... CArgs, class... MArgs>
+  CUTE_HOST_DEVICE static constexpr
+  auto
+  make_tiled_copy_E(Copy_Atom<CArgs...> const& copy_atom,
+                    TiledMMA<MArgs...>  const& mma)
+  {
+    return make_tiled_copy_impl(copy_atom, get_layoutE_TV(mma), make_shape(tile_size<0>(mma),tile_size<2>(mma)));
+  }
+
+};
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+} // namespace cutlass::gemm::collective
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/include/cutlass/gemm/device/base_grouped.h b/include/cutlass/gemm/device/base_grouped.h
index 51b9d3dc10..eec61981f8 100644
--- a/include/cutlass/gemm/device/base_grouped.h
+++ b/include/cutlass/gemm/device/base_grouped.h
@@ -432,6 +432,7 @@ class BaseGrouped {
     //
 
     // Launch
+    cutlass::arch::synclog_setup();
     cutlass::Kernel<BaseKernel><<<grid, block, smem_size, stream>>>(params_);
 
     //
diff --git a/include/cutlass/gemm/device/default_gemm_configuration.h b/include/cutlass/gemm/device/default_gemm_configuration.h
index 4197a6b080..e7ed2da940 100644
--- a/include/cutlass/gemm/device/default_gemm_configuration.h
+++ b/include/cutlass/gemm/device/default_gemm_configuration.h
@@ -764,6 +764,58 @@ struct DefaultGemmConfiguration<
 
 ////////////////////////////////////////////////////////////////////////////////
 
+template <
+  typename ElementC>
+struct DefaultGemmConfiguration<
+  arch::OpClassTensorOp,
+  arch::Sm80,
+  int4b_t,
+  int8_t,
+  ElementC,
+  int32_t> {
+
+  static int const kAlignmentA = 128 / sizeof_bits<int4b_t>::value;
+  static int const kAlignmentB = 128 / sizeof_bits<int8_t>::value;
+
+  using ThreadblockShape = GemmShape<128, 256, 64>;
+  using WarpShape = GemmShape<64, 64, 64>;
+  using InstructionShape = GemmShape<16, 8, 32>;
+  static int const kStages = 3;
+
+  using EpilogueOutputOp = epilogue::thread::LinearCombinationClamp<
+      ElementC, 128 / sizeof_bits<ElementC>::value, int32_t, float>;
+
+  using Operator = arch::OpMultiplyAddSaturate;
+};
+
+////////////////////////////////////////////////////////////////////////////////
+
+template <
+  typename ElementC>
+struct DefaultGemmConfiguration<
+  arch::OpClassTensorOp,
+  arch::Sm80,
+  int8_t,
+  int4b_t,
+  ElementC,
+  int32_t> {
+
+  static int const kAlignmentA = 128 / sizeof_bits<int8_t>::value;
+  static int const kAlignmentB = 128 / sizeof_bits<int4b_t>::value;
+
+  using ThreadblockShape = GemmShape<128, 256, 64>;
+  using WarpShape = GemmShape<64, 64, 64>;
+  using InstructionShape = GemmShape<16, 8, 32>;
+  static int const kStages = 3;
+
+  using EpilogueOutputOp = epilogue::thread::LinearCombinationClamp<
+      ElementC, 128 / sizeof_bits<ElementC>::value, int32_t, float>;
+
+  using Operator = arch::OpMultiplyAddSaturate;
+};
+
+////////////////////////////////////////////////////////////////////////////////
+
 /// Base configuration for all {fe4m3, fe5m2} x {fe4m3, fe5m2} combinations on SM89
 template <
   typename ElementA,
diff --git a/include/cutlass/gemm/device/ell_gemm.h b/include/cutlass/gemm/device/ell_gemm.h
index f5b65cea29..54ddab4007 100644
--- a/include/cutlass/gemm/device/ell_gemm.h
+++ b/include/cutlass/gemm/device/ell_gemm.h
@@ -517,6 +517,7 @@ class EllGemm {
       }
     }
 
+    cutlass::arch::synclog_setup();
     cutlass::Kernel<GemmKernel><<<grid, block, smem_size, stream>>>(params_);
 
     result = cudaGetLastError();
diff --git a/include/cutlass/gemm/device/gemm.h b/include/cutlass/gemm/device/gemm.h
index 83ef21cb83..7a8ac552eb 100644
--- a/include/cutlass/gemm/device/gemm.h
+++ b/include/cutlass/gemm/device/gemm.h
@@ -497,6 +497,7 @@ class Gemm {
 
     syclcompat::launch<cutlass::Kernel<GemmKernel>>(sycl_grid, sycl_block, smem_size, params_);
 #else
+    cutlass::arch::synclog_setup();
     cutlass::Kernel<GemmKernel><<<grid, block, smem_size, stream>>>(params_);
 #endif
 
diff --git a/include/cutlass/gemm/device/gemm_array.h b/include/cutlass/gemm/device/gemm_array.h
index 6bbd90c1cd..1ae2db467f 100644
--- a/include/cutlass/gemm/device/gemm_array.h
+++ b/include/cutlass/gemm/device/gemm_array.h
@@ -446,6 +446,7 @@ class GemmArray {
       }
     }
 
+    cutlass::arch::synclog_setup();
     cutlass::Kernel<GemmKernel><<<grid, block, smem_size, stream>>>(params_);
 
     result = cudaGetLastError();
diff --git a/include/cutlass/gemm/device/gemm_batched.h b/include/cutlass/gemm/device/gemm_batched.h
index 3be34c808d..5981457c73 100644
--- a/include/cutlass/gemm/device/gemm_batched.h
+++ b/include/cutlass/gemm/device/gemm_batched.h
@@ -424,6 +424,7 @@ class GemmBatched {
       }
     }
 
+    cutlass::arch::synclog_setup();
     cutlass::Kernel<GemmKernel><<<grid, block, smem_size, stream>>>(params_);
 
     result = cudaGetLastError();
diff --git a/include/cutlass/gemm/device/gemm_complex.h b/include/cutlass/gemm/device/gemm_complex.h
index 36f57d6469..e36c69cefb 100644
--- a/include/cutlass/gemm/device/gemm_complex.h
+++ b/include/cutlass/gemm/device/gemm_complex.h
@@ -445,6 +445,7 @@ class GemmComplex {
       }
     }
 
+    cutlass::arch::synclog_setup();
     cutlass::Kernel<GemmKernel><<<grid, block, smem_size, stream>>>(params_);
 
     result = cudaGetLastError();
diff --git a/include/cutlass/gemm/device/gemm_sparse.h b/include/cutlass/gemm/device/gemm_sparse.h
index 1b1d27bda5..ac453c63b5 100644
--- a/include/cutlass/gemm/device/gemm_sparse.h
+++ b/include/cutlass/gemm/device/gemm_sparse.h
@@ -479,6 +479,7 @@ class SparseGemm {
 
     int smem_size = int(sizeof(typename GemmKernel::SharedStorage));
 
+    cutlass::arch::synclog_setup();
     cutlass::Kernel<GemmKernel><<<grid, block, smem_size, stream>>>(params_);
 
     cudaError_t result = cudaGetLastError();
diff --git a/include/cutlass/gemm/device/gemm_sparse_with_absmax.h b/include/cutlass/gemm/device/gemm_sparse_with_absmax.h
index e6db107604..e599217a13 100644
--- a/include/cutlass/gemm/device/gemm_sparse_with_absmax.h
+++ b/include/cutlass/gemm/device/gemm_sparse_with_absmax.h
@@ -324,6 +324,7 @@ class SparseGemmWithAbsmax {
 
     int smem_size = int(sizeof(typename GemmKernel::SharedStorage));
 
+    cutlass::arch::synclog_setup();
     cutlass::Kernel<GemmKernel><<<grid, block, smem_size, stream>>>(params_);
 
     cudaError_t result = cudaGetLastError();
diff --git a/include/cutlass/gemm/device/gemm_splitk_parallel.h b/include/cutlass/gemm/device/gemm_splitk_parallel.h
index 2c9408df0e..f78c5a2169 100644
--- a/include/cutlass/gemm/device/gemm_splitk_parallel.h
+++ b/include/cutlass/gemm/device/gemm_splitk_parallel.h
@@ -357,6 +357,7 @@ class GemmSplitKParallel {
       }
     }
 
+    cutlass::arch::synclog_setup();
     Kernel<GemmKernel><<<grid, block, smem_size, stream>>>(gemm_params_);
 
     result = cudaGetLastError();
diff --git a/include/cutlass/gemm/device/gemm_universal_adapter.h b/include/cutlass/gemm/device/gemm_universal_adapter.h
index 40a21b1078..8c9d37e573 100644
--- a/include/cutlass/gemm/device/gemm_universal_adapter.h
+++ b/include/cutlass/gemm/device/gemm_universal_adapter.h
@@ -44,6 +44,7 @@
 #include "cutlass/detail/mma.hpp"
 #include "cutlass/cuda_host_adapter.hpp"
 
+#include "cutlass/kernel_launch.h"
 #if !defined(__CUDACC_RTC__)
 #include "cutlass/cluster_launch.hpp"
 #include "cutlass/trace.h"
@@ -215,9 +216,10 @@ class GemmUniversalAdapter<
       workspace_bytes += sizeof(int) * size_t(cute::size<0>(TileShape{})) * size_t(cute::size<1>(TileShape{}));
     }
 
+    workspace_bytes += GemmKernel::get_workspace_size(args);
+
     CUTLASS_TRACE_HOST("  workspace_bytes: " << workspace_bytes);
 
-    workspace_bytes += GemmKernel::get_workspace_size(args);
     return workspace_bytes;
   }
 
@@ -356,9 +358,13 @@ class GemmUniversalAdapter<
     Status launch_result{ Status::kSuccess };
     // Use extended launch API only for mainloops that use it
     if constexpr (GemmKernel::ArchTag::kMinComputeCapability >= 90) {
+#if (CUTLASS_DEBUG_TRACE_LEVEL > 1)
+      CUTLASS_TRACE_HOST("GemmUniversal::run: Use extended launch API");
+#endif
 #if !defined(CUTLASS_ENABLE_SYCL)
-      constexpr bool is_static_1x1x1 = cute::is_static_v<typename GemmKernel::DispatchPolicy::ClusterShape> and
-                                       cute::size(typename GemmKernel::DispatchPolicy::ClusterShape{}) == 1;
+      [[maybe_unused]] constexpr bool is_static_1x1x1 =
+        cute::is_static_v<typename GemmKernel::DispatchPolicy::ClusterShape> and
+        cute::size(typename GemmKernel::DispatchPolicy::ClusterShape{}) == 1;
       dim3 cluster(cute::size<0>(typename GemmKernel::DispatchPolicy::ClusterShape{}),
                    cute::size<1>(typename GemmKernel::DispatchPolicy::ClusterShape{}),
                    cute::size<2>(typename GemmKernel::DispatchPolicy::ClusterShape{}));
@@ -370,12 +376,14 @@ class GemmUniversalAdapter<
         //
         CUTLASS_ASSERT(cuda_adapter);
         if (cuda_adapter) {
-
           if (launch_with_pdl) {
             CUTLASS_TRACE_HOST(
               "GemmUniversal::run() does not support launching with PDL and a custom cuda adapter.");
             return Status::kErrorInternal;
           }
+#if (CUTLASS_DEBUG_TRACE_LEVEL > 1)
+          CUTLASS_TRACE_HOST("GemmUniversal::run: Launching kernel with CUDA host adapter");
+#endif
           launch_result = cuda_adapter->launch(grid,
                                                cluster,
                                                block,
@@ -385,6 +393,7 @@ class GemmUniversalAdapter<
                                                0);
         }
         else {
+          CUTLASS_TRACE_HOST("GemmUniversal::run: kEnableCudaHostAdapter is true, but CUDA host adapter is null");
           return Status::kErrorInternal;
         }
       }
@@ -392,10 +401,25 @@ class GemmUniversalAdapter<
         CUTLASS_ASSERT(cuda_adapter == nullptr);
         void const* kernel = (void const*) device_kernel<GemmKernel>;
         if constexpr (GemmKernel::ArchTag::kMinComputeCapability == 90) {
-          if (is_static_1x1x1 && not launch_with_pdl) {
-            device_kernel<GemmKernel><<<grid, block, smem_size, stream>>>(params);
+          if constexpr (is_static_1x1x1) {
+#if (CUTLASS_DEBUG_TRACE_LEVEL > 1)
+            CUTLASS_TRACE_HOST("GemmUniversal::run: Launching static 1x1x1 kernel");
+#endif
+            launch_result = cutlass::kernel_launch<GemmKernel>(
+              grid, block, smem_size, stream, params, launch_with_pdl);
+            if (launch_result != Status::kSuccess) {
+              CUTLASS_TRACE_HOST("GemmUniversal::run: cutlass::kernel_launch reports failure");
+            }
+#if (CUTLASS_DEBUG_TRACE_LEVEL > 1)
+            else {
+              CUTLASS_TRACE_HOST("GemmUniversal::run: cutlass::kernel_launch reports success");
+            }
+#endif
           }
           else {
+#if (CUTLASS_DEBUG_TRACE_LEVEL > 1)
+            CUTLASS_TRACE_HOST("GemmUniversal::run: Launching dynamic cluster kernel");
+#endif
             launch_result = ClusterLauncher::launch(
               grid, cluster, block, smem_size, stream, kernel, kernel_params, launch_with_pdl);
           }
@@ -405,17 +429,22 @@ class GemmUniversalAdapter<
     }
     else {
       launch_result = Status::kSuccess;
+      cutlass::arch::synclog_setup();
+
       if constexpr (kEnableCudaHostAdapter) {
         CUTLASS_ASSERT(cuda_adapter);
         if (cuda_adapter) {
           void* kernel_params[] = {&params};
-
+#if (CUTLASS_DEBUG_TRACE_LEVEL > 1)
+          CUTLASS_TRACE_HOST("GemmUniversal::run: Launching kernel with CUDA host adapter");
+#endif
           launch_result = cuda_adapter->launch(
             grid, block, smem_size, stream, kernel_params, 0
           );
 
         }
         else {
+          CUTLASS_TRACE_HOST("GemmUniversal::run: CUDA host adapter is null");
           return Status::kErrorInternal;
         }
       }
@@ -428,7 +457,7 @@ class GemmUniversalAdapter<
         using namespace syclcompat::experimental;
 #if defined (SYCL_INTEL_TARGET)
         auto event = launch<device_kernel<GemmKernel>>(launch_policy{
-          sycl_grid, sycl_block, local_mem_size{static_cast<std::size_t>(smem_size)}, 
+          sycl_grid, sycl_block, local_mem_size{static_cast<std::size_t>(smem_size)},
           kernel_properties{sycl_exp::sub_group_size<DispatchPolicy::SubgroupSize>}
         }, params);
 #else
@@ -438,13 +467,28 @@ class GemmUniversalAdapter<
 #endif
         EventManager::getInstance().addEvent(event);
 #else
-        device_kernel<GemmKernel><<<grid, block, smem_size, stream>>>(params);
+#if (CUTLASS_DEBUG_TRACE_LEVEL > 1)
+        CUTLASS_TRACE_HOST("GemmUniversal::run: Launching kernel with cutlass::kernel_launch");
+#endif
+        launch_result = cutlass::kernel_launch<GemmKernel>(
+          grid, block, smem_size, stream, params, launch_with_pdl);
+        if (launch_result != Status::kSuccess) {
+          CUTLASS_TRACE_HOST("GemmUniversal::run: cutlass::kernel_launch reports failure");
+        }
+#if (CUTLASS_DEBUG_TRACE_LEVEL > 1)
+        else {
+          CUTLASS_TRACE_HOST("GemmUniversal::run: cutlass::kernel_launch reports success");
+        }
+#endif
 #endif
       }
     }
 
     cudaError_t result = cudaGetLastError();
     if (cudaSuccess == result && Status::kSuccess == launch_result) {
+#if (CUTLASS_DEBUG_TRACE_LEVEL > 1)
+      CUTLASS_TRACE_HOST("GemmUniversal::run: cudaGetLastError reports success");
+#endif
       return Status::kSuccess;
     }
     else {
diff --git a/include/cutlass/gemm/device/gemm_universal_base.h b/include/cutlass/gemm/device/gemm_universal_base.h
index 8e6604b595..8542357dbe 100644
--- a/include/cutlass/gemm/device/gemm_universal_base.h
+++ b/include/cutlass/gemm/device/gemm_universal_base.h
@@ -443,6 +443,8 @@ class GemmUniversalBase {
       "block: (" << block << "), "
       "SMEM: (" << kSharedStorageSize << ")");
 
+    cutlass::arch::synclog_setup();
+
     if constexpr (kEnableCudaHostAdapter) {
       CUTLASS_ASSERT(cuda_adapter);
       if (cuda_adapter) {
diff --git a/include/cutlass/gemm/device/gemv.h b/include/cutlass/gemm/device/gemv.h
index 341124942a..5e181743ef 100644
--- a/include/cutlass/gemm/device/gemv.h
+++ b/include/cutlass/gemm/device/gemv.h
@@ -141,6 +141,7 @@ class Gemv {
     int smem_size = int(sizeof(typename GemvKernel::SharedStorage));
     
     // Launch
+    cutlass::arch::synclog_setup();
     cutlass::Kernel<GemvKernel><<<grid, block, smem_size, stream>>>(params_);
 
     //
diff --git a/include/cutlass/gemm/device/rank_2k.h b/include/cutlass/gemm/device/rank_2k.h
index d12621e6b9..296f38cad2 100644
--- a/include/cutlass/gemm/device/rank_2k.h
+++ b/include/cutlass/gemm/device/rank_2k.h
@@ -319,6 +319,7 @@ class Rank2K {
 
     int smem_size = int(sizeof(typename Rank2Kkernel::SharedStorage));
 
+    cutlass::arch::synclog_setup();
     cutlass::Kernel<Rank2Kkernel><<<grid, block, smem_size, stream>>>(params_);
 
     cudaError_t result = cudaGetLastError();
diff --git a/include/cutlass/gemm/device/rank_k.h b/include/cutlass/gemm/device/rank_k.h
index e6e9d025a4..ae18a11b80 100644
--- a/include/cutlass/gemm/device/rank_k.h
+++ b/include/cutlass/gemm/device/rank_k.h
@@ -296,6 +296,7 @@ class RankK {
 
     int smem_size = int(sizeof(typename RankKkernel::SharedStorage));
 
+    cutlass::arch::synclog_setup();
     cutlass::Kernel<RankKkernel><<<grid, block, smem_size, stream>>>(params_);
 
     cudaError_t result = cudaGetLastError();
diff --git a/include/cutlass/gemm/device/symm.h b/include/cutlass/gemm/device/symm.h
index 223e1b0d10..c36ef959b1 100755
--- a/include/cutlass/gemm/device/symm.h
+++ b/include/cutlass/gemm/device/symm.h
@@ -337,6 +337,7 @@ class Symm {
       }
     }
 
+    cutlass::arch::synclog_setup();
     cutlass::Kernel<SymmKernel><<<grid, block, smem_size, stream>>>(params_);
 
     cudaError_t result = cudaGetLastError();
diff --git a/include/cutlass/gemm/device/trmm.h b/include/cutlass/gemm/device/trmm.h
index e354e7a132..09b9152cbb 100644
--- a/include/cutlass/gemm/device/trmm.h
+++ b/include/cutlass/gemm/device/trmm.h
@@ -495,6 +495,7 @@ class Trmm {
       }
     }
 
+    cutlass::arch::synclog_setup();
     cutlass::Kernel<TrmmKernel><<<grid, block, smem_size, stream>>>(params_);
 
     cudaError_t result = cudaGetLastError();
diff --git a/include/cutlass/gemm/dispatch_policy.hpp b/include/cutlass/gemm/dispatch_policy.hpp
index cf4fabc673..acc0961d64 100644
--- a/include/cutlass/gemm/dispatch_policy.hpp
+++ b/include/cutlass/gemm/dispatch_policy.hpp
@@ -34,7 +34,7 @@
 #include "cutlass/gemm/gemm.h"
 
 #include "cute/layout.hpp"
-#include "cute/numeric/integral_constant.hpp"
+#include "cute/numeric/integral_constant.hpp" // cute::false_type
 //////////////////////////////////////////////////////////////////////////////
 
 namespace cutlass::detail {
@@ -48,6 +48,16 @@ struct is_kernel_tag_of<U<Args...>, U> : cute::true_type {};
 template <class T, template <int...> class U>
 constexpr bool is_kernel_tag_of_v = is_kernel_tag_of<T, U>::value;
 
+template <class T, template <int,bool> class U>
+struct is_asymmetric_dma_kernel_tag_of : cute::false_type {};
+
+template <template <int, bool> class U, int I0, bool B0>
+struct is_asymmetric_dma_kernel_tag_of<U<I0, B0>, U> : cute::true_type {};
+
+template <class T, template <int, bool> class U>
+constexpr bool is_asymmetric_dma_kernel_tag_of_v = \
+                              is_asymmetric_dma_kernel_tag_of<T, U>::value;
+
 }
 
 //////////////////////////////////////////////////////////////////////////////
@@ -96,8 +106,11 @@ struct KernelCpAsyncWarpSpecializedCooperative { };
 struct KernelTma { };
 struct KernelTmaWarpSpecialized { };
 struct KernelTmaWarpSpecializedPingpong { };
-struct KernelTmaWarpSpecializedCooperative { };
+struct KernelTmaWarpSpecializedCooperative { 
+};
+
 struct KernelPtrArrayTmaWarpSpecializedCooperative { };
+struct KernelPtrArrayTmaWarpSpecializedPingpong { };
 
 struct KernelPVC { };
 
@@ -113,6 +126,7 @@ struct KernelTmaWarpSpecializedFP8FastAccum : KernelTmaWarpSpecialized { };
 struct KernelTmaWarpSpecializedPingpongFP8FastAccum : KernelTmaWarpSpecializedPingpong { };
 struct KernelTmaWarpSpecializedCooperativeFP8FastAccum: KernelTmaWarpSpecializedCooperative { };
 struct KernelPtrArrayTmaWarpSpecializedCooperativeFP8FastAccum : KernelPtrArrayTmaWarpSpecializedCooperative { };
+struct KernelPtrArrayTmaWarpSpecializedPingpongFP8FastAccum : KernelPtrArrayTmaWarpSpecializedPingpong { };
 
 // Policies to opt into mixed type GEMMs
 struct KernelTmaWarpSpecializedMixedInput : KernelTmaWarpSpecialized { };
@@ -288,8 +302,22 @@ struct MainloopSm90ArrayTmaGmmaWarpSpecialized {
   using ArchTag = arch::Sm90;
   using Schedule = KernelSchedule;
   static_assert(
-    cute::is_base_of_v<KernelPtrArrayTmaWarpSpecializedCooperative, KernelSchedule>,
-    "KernelSchedule must be one of the Ptr-Array or Grouped Gemm TMA Warp Specialized Cooperative policies");
+    cute::is_base_of_v<KernelPtrArrayTmaWarpSpecializedCooperative, KernelSchedule> ||
+    cute::is_base_of_v<KernelPtrArrayTmaWarpSpecializedPingpong, KernelSchedule>,
+    "KernelSchedule must be one of the Ptr-Array or Grouped Gemm TMA Warp Specialized Cooperative or Pingpong policies");
+};
+
+// n-buffer in smem (Hopper TMA), pipelined with Hopper sparse GMMA and TMA, Warp specialized dynamic schedule
+template<
+  int Stages_,
+  class ClusterShape_ = Shape<_1,_1,_1>,
+  class KernelSchedule = KernelTmaWarpSpecializedCooperative
+>
+struct MainloopSm90TmaGmmaWarpSpecializedSparse {
+  constexpr static int Stages = Stages_;
+  using ClusterShape = ClusterShape_;
+  using ArchTag = arch::Sm90;
+  using Schedule = KernelSchedule;
 };
 
 
@@ -307,3 +335,4 @@ struct MainloopIntelPVC {
 //////////////////////////////////////////////////////////////////////////////
 
 } // namespace cutlass::gemm
+
diff --git a/include/cutlass/gemm/gemm.h b/include/cutlass/gemm/gemm.h
index 33d3943f85..ac288e3e81 100644
--- a/include/cutlass/gemm/gemm.h
+++ b/include/cutlass/gemm/gemm.h
@@ -70,7 +70,7 @@ using cutlass::detail::StrideToLayoutTagC_t;
 template<int ModeIndex, class Stride>
 constexpr bool
 is_major(Stride = {}) {
-  return ::cutlass::detail::is_major<ModeIndex, Stride>();
+  return ::cutlass::detail::is_major<ModeIndex>(Stride{});
 }
 
 template<class Stride>
diff --git a/include/cutlass/gemm/kernel/gemm_universal.hpp b/include/cutlass/gemm/kernel/gemm_universal.hpp
index 839ebb23cd..5f88b7aa08 100644
--- a/include/cutlass/gemm/kernel/gemm_universal.hpp
+++ b/include/cutlass/gemm/kernel/gemm_universal.hpp
@@ -61,6 +61,7 @@ struct IsCutlass3ArrayKernel<ProblemShape, cute::void_t<typename ProblemShape::U
 #include "cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized.hpp"
 #include "cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp"
 #include "cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp"
+#include "cutlass/gemm/kernel/sm90_gemm_array_tma_warpspecialized_pingpong.hpp"
 #include "cutlass/gemm/kernel/sm90_gemm_array_tma_warpspecialized_cooperative.hpp"
 
 #if defined(SYCL_INTEL_TARGET)
diff --git a/include/cutlass/gemm/kernel/gemm_universal_with_visitor.h b/include/cutlass/gemm/kernel/gemm_universal_with_visitor.h
index 0d9fbc3fc3..5ce123a1a6 100644
--- a/include/cutlass/gemm/kernel/gemm_universal_with_visitor.h
+++ b/include/cutlass/gemm/kernel/gemm_universal_with_visitor.h
@@ -52,12 +52,12 @@ template <
   typename Epilogue,             ///! Epilogue
   typename ThreadblockSwizzle_   ///! Threadblock swizzling function
 >
-class GemmWithEpilogueVisitor: GemmUniversal<Mma,Epilogue, ThreadblockSwizzle_> {
+class GemmWithEpilogueVisitor: public GemmUniversal<Mma, Epilogue, ThreadblockSwizzle_> {
 public:
 
   using ThreadblockSwizzle = ThreadblockSwizzle_;
 
-  using Base = GemmUniversal<Mma,Epilogue, ThreadblockSwizzle>;
+  using Base = GemmUniversal<Mma, Epilogue, ThreadblockSwizzle>;
   using Base::Base;
 
   using FusionCallbacks = typename Epilogue::FusionCallbacks;
diff --git a/include/cutlass/gemm/kernel/sm70_gemm.hpp b/include/cutlass/gemm/kernel/sm70_gemm.hpp
index 3693da97ee..9924a493c8 100644
--- a/include/cutlass/gemm/kernel/sm70_gemm.hpp
+++ b/include/cutlass/gemm/kernel/sm70_gemm.hpp
@@ -81,6 +81,8 @@ class GemmUniversal<
     TileScheduler_, ArchTag, TileShape,
     cute::Shape<cute::Int<1>, cute::Int<1>, cute::Int<1>>>::Scheduler;
   using TileSchedulerArguments = typename TileScheduler::Arguments;
+  static constexpr bool IsGdcEnabled = false;
+
   static constexpr bool is_valid_tile_scheduler =
   cute::is_void_v<TileScheduler_> or cute::is_same_v<TileScheduler_, PersistentScheduler>;
 static_assert(is_valid_tile_scheduler, "SM70 kernel does not support specializing the tile scheduler.");
diff --git a/include/cutlass/gemm/kernel/sm90_gemm_array_tma_warpspecialized_cooperative.hpp b/include/cutlass/gemm/kernel/sm90_gemm_array_tma_warpspecialized_cooperative.hpp
index bbffcfb894..5b2029589a 100644
--- a/include/cutlass/gemm/kernel/sm90_gemm_array_tma_warpspecialized_cooperative.hpp
+++ b/include/cutlass/gemm/kernel/sm90_gemm_array_tma_warpspecialized_cooperative.hpp
@@ -46,6 +46,9 @@
 #include "cutlass/pipeline/pipeline.hpp"
 #include "cute/tensor.hpp"
 #include "cutlass/trace.h"
+#include "cutlass/gemm/kernel/sm90_tile_scheduler.hpp"
+#include "cutlass/gemm/kernel/sm90_tile_scheduler_group.hpp"
+
 ///////////////////////////////////////////////////////////////////////////////
 
 namespace cutlass::gemm::kernel {
@@ -73,6 +76,11 @@ class GemmUniversal<
   using ProblemShape = ProblemShape_;
   static_assert(rank(typename ProblemShape::UnderlyingProblemShape{}) == 3 or rank(typename ProblemShape::UnderlyingProblemShape{}) == 4,
     "ProblemShape{} should be <M,N,K> or <M,N,K,L>");
+
+  static_assert(cute::is_base_of_v<KernelPtrArrayTmaWarpSpecializedCooperative, typename CollectiveMainloop_::DispatchPolicy::Schedule>);
+
+  static constexpr bool IsGdcEnabled = false;
+
   // Mainloop derived types
   using CollectiveMainloop = CollectiveMainloop_;
   using TileShape = typename CollectiveMainloop::TileShape;
@@ -119,8 +127,9 @@ class GemmUniversal<
   using TileSchedulerParams = typename TileScheduler::Params;
 
   static constexpr uint32_t NumLoadWarpGroups = 1;
-  static constexpr uint32_t NumMmaWarpGroups = CUTE_STATIC_V(size(TiledMma{})) / NumThreadsPerWarpGroup;
-  static constexpr uint32_t MaxThreadsPerBlock = CUTE_STATIC_V(size(TiledMma{})) + (NumLoadWarpGroups * NumThreadsPerWarpGroup);
+  static constexpr uint32_t NumMmaThreads = CUTE_STATIC_V(size(TiledMma{}));
+  static constexpr uint32_t NumMmaWarpGroups = NumMmaThreads / NumThreadsPerWarpGroup;
+  static constexpr uint32_t MaxThreadsPerBlock = NumMmaThreads + (NumLoadWarpGroups * NumThreadsPerWarpGroup);
   static constexpr uint32_t MinBlocksPerMultiprocessor = 1;
 
   /// Register requirement for Load and Math WGs
@@ -132,7 +141,7 @@ class GemmUniversal<
 
   // Kernel level shared memory storage
   struct SharedStorage {
-    struct TensorStorage : cute::aligned_struct<128> {
+    struct TensorStorage : cute::aligned_struct<128, _1> {
       using MainloopTensorStorage = typename CollectiveMainloop::TensorStorage;
       using EpilogueTensorStorage = typename CollectiveEpilogue::TensorStorage;
 
@@ -140,7 +149,7 @@ class GemmUniversal<
       EpilogueTensorStorage epilogue;
     } tensors;
 
-    struct PipelineStorage : cute::aligned_struct<16> {
+    struct PipelineStorage : cute::aligned_struct<16, _1> {
       using MainloopPipelineStorage = typename CollectiveMainloop::PipelineStorage;
       using EpiLoadPipelineStorage = typename CollectiveEpilogue::PipelineStorage;
 
@@ -149,7 +158,7 @@ class GemmUniversal<
       alignas(16) typename LoadWarpOrderBarrier::SharedStorage load_order;
     } pipelines;
 
-    struct TensorMapStorage : cute::aligned_struct<128> {
+    struct TensorMapStorage : cute::aligned_struct<128, _1> {
       using MainloopTensorMapStorage = typename CollectiveMainloop::TensorMapStorage;
       using EpilogueTensorMapStorage = typename CollectiveEpilogue::TensorMapStorage;
 
@@ -215,25 +224,21 @@ class GemmUniversal<
     workspace_offset = round_nearest(workspace_offset,  MinWorkspaceAlignment);
 
     void* epilogue_workspace = workspace_ptr + workspace_offset;
-    workspace_offset += CollectiveEpilogue::get_workspace_size(problem_shapes, args.epilogue, args.hw_info.sm_count);
+    workspace_offset += CollectiveEpilogue::get_workspace_size(problem_shapes, args.epilogue, sm_count);
     workspace_offset = round_nearest(workspace_offset,  MinWorkspaceAlignment);
 
     void* mainloop_workspace = workspace_ptr + workspace_offset;
-    workspace_offset += CollectiveMainloop::get_workspace_size(problem_shapes, args.mainloop, args.hw_info.sm_count);
+    workspace_offset += CollectiveMainloop::get_workspace_size(problem_shapes, args.mainloop, sm_count);
     workspace_offset = round_nearest(workspace_offset,  MinWorkspaceAlignment);
 
-    // Precompute the sub tiles numbers in epilogue, pass into tile scheduler.  Therefore it will be used
-    // in separate reduction scheme for streamk case, NumEpilogueSubTiles default value is 1, which means
-    // subtile will not be used, therefore separate reduction will not be enabled.
-    constexpr uint32_t NumEpilogueSubTiles = CollectiveEpilogue::get_store_pipe_increment(TileShape{});
     TileSchedulerParams scheduler;
     if constexpr (IsGroupedGemmKernel) {
       scheduler = TileScheduler::to_underlying_arguments(
-      problem_shapes, TileShape{}, ClusterShape{}, hw_info, args.scheduler, scheduler_workspace, NumEpilogueSubTiles);
+      problem_shapes, TileShape{}, ClusterShape{}, hw_info, args.scheduler, scheduler_workspace);
     }
     else {
       scheduler = TileScheduler::to_underlying_arguments(
-      problem_shapes.get_host_problem_shape(), TileShape{}, ClusterShape{}, hw_info, args.scheduler, scheduler_workspace, NumEpilogueSubTiles);
+      problem_shapes.get_host_problem_shape(), TileShape{}, ClusterShape{}, hw_info, args.scheduler, scheduler_workspace);
     }
 
     return {
@@ -275,9 +280,6 @@ class GemmUniversal<
       args.scheduler, typename ProblemShape::UnderlyingProblemShape{}, args.hw_info, NumMmaWarpGroups, NumEpilogueSubTiles);
     workspace_size = round_nearest(workspace_size,  MinWorkspaceAlignment);
 
-    workspace_size += CollectiveEpilogue::get_workspace_size(args.problem_shape, args.epilogue, args.hw_info.sm_count);
-    workspace_size = round_nearest(workspace_size,  MinWorkspaceAlignment);
-
     // Get SM count if needed, otherwise use user supplied SM count
     int sm_count = args.hw_info.sm_count;
     if (sm_count <= 0) {
@@ -286,6 +288,9 @@ class GemmUniversal<
       sm_count = KernelHardwareInfo::query_device_multiprocessor_count(args.hw_info.device_id);
     }
 
+    workspace_size += CollectiveEpilogue::get_workspace_size(args.problem_shape, args.epilogue, sm_count);
+    workspace_size = round_nearest(workspace_size,  MinWorkspaceAlignment);
+
     workspace_size += CollectiveMainloop::get_workspace_size(args.problem_shape, args.mainloop, sm_count);
     workspace_size = round_nearest(workspace_size,  MinWorkspaceAlignment);
 
@@ -299,9 +304,10 @@ class GemmUniversal<
     uint8_t* workspace_ptr = reinterpret_cast<uint8_t*>(workspace);
     size_t workspace_offset = 0;
     constexpr uint32_t NumEpilogueSubTiles = CollectiveEpilogue::get_store_pipe_increment(TileShape{});
+    static constexpr uint32_t NumAccumulatorMtxs = 1;
 
     status = TileScheduler::template initialize_workspace<typename ProblemShape::UnderlyingProblemShape, ElementAccumulator>(
-      args.scheduler, workspace_ptr + workspace_offset, stream, typename ProblemShape::UnderlyingProblemShape{}, args.hw_info, NumMmaWarpGroups, NumEpilogueSubTiles, cuda_adapter);
+      args.scheduler, workspace_ptr + workspace_offset, stream, typename ProblemShape::UnderlyingProblemShape{}, args.hw_info, NumMmaWarpGroups, NumEpilogueSubTiles, NumAccumulatorMtxs, cuda_adapter);
     workspace_offset += TileScheduler::template get_workspace_size<typename ProblemShape::UnderlyingProblemShape, ElementAccumulator>(
       args.scheduler, typename ProblemShape::UnderlyingProblemShape{}, args.hw_info, NumMmaWarpGroups, NumEpilogueSubTiles);
     workspace_offset = round_nearest(workspace_offset,  MinWorkspaceAlignment);
@@ -335,10 +341,10 @@ class GemmUniversal<
     args.raster_order = params.scheduler.raster_order_ == TileScheduler::RasterOrder::AlongN ? TileScheduler::RasterOrderOptions::AlongN : TileScheduler::RasterOrderOptions::AlongM;
     dim3 grid_shape;
     if constexpr (IsGroupedGemmKernel) {
-      grid_shape = TileScheduler::get_grid_shape(params.problem_shape, TileShape{}, ClusterShape{}, params.hw_info, args);
+      grid_shape = TileScheduler::get_grid_shape(params.scheduler, params.problem_shape, TileShape{}, ClusterShape{}, params.hw_info, args);
     }
     else {
-      grid_shape = TileScheduler::get_grid_shape(params.problem_shape.get_host_problem_shape(), TileShape{}, ClusterShape{}, params.hw_info, args);
+      grid_shape = TileScheduler::get_grid_shape(params.scheduler, params.problem_shape.get_host_problem_shape(), TileShape{}, ClusterShape{}, params.hw_info, args);
     }
     return grid_shape;
   }
@@ -363,6 +369,12 @@ class GemmUniversal<
     static_assert(size(TiledMma{}) == 256, "Cooperative kernel must have TiledMMA operating using 256 threads.");
     static_assert(size<0>(TileShape{}) >= 128,
         "Cooperative kernel requires Tile Size to be greater than or equal to 128 along the M-dimension.");
+    static_assert(NumMmaWarpGroups == 2, "Cooperative kernels currently only support NumMmaWarpGroups == 2");
+
+    if constexpr (cutlass::epilogue::collective::detail::sm90_is_ptr_array_tma_dispatch_policy_v<typename CollectiveEpilogue::DispatchPolicy>) {
+      static_assert(NumMmaWarpGroups == CollectiveEpilogue::NumEpilogueWarpGroups,
+                    "Tiled MmA does not match expected warp groups performing the epilogue");
+    }
 
     static_assert(cute::rank(InternalStrideA{}) == 3, "StrideA must be rank-3: [M, K, L]. If batch mode is not needed, set L stride to Int<0>.");
     static_assert(cute::rank(InternalStrideB{}) == 3, "StrideB must be rank-3: [N, K, L]. If batch mode is not needed, set L stride to Int<0>.");
@@ -391,7 +403,8 @@ class GemmUniversal<
     int warp_idx_in_warp_group = warp_idx % NumWarpsPerWarpGroup;
     int warp_group_thread_idx = thread_idx % NumThreadsPerWarpGroup;
     int mma_thread_idx = thread_idx % size(TiledMma{});
-    auto warp_group_role = WarpGroupRole(canonical_warp_group_idx());
+    auto warp_group_idx = canonical_warp_group_idx();
+    auto warp_group_role = WarpGroupRole(warp_group_idx);
     auto producer_warp_role = ProducerWarpRole(warp_idx_in_warp_group);
     int lane_predicate = cute::elect_one_sync();
     uint32_t block_rank_in_cluster = cute::block_rank_in_cluster();
@@ -466,7 +479,9 @@ class GemmUniversal<
 
     // Get the appropriate blocks for this thread block -- potential for thread block locality
     TiledMma tiled_mma;
-    auto blk_shape = TileShape{};                                                                // (BLK_M,BLK_N,BLK_K)
+    const auto blk_shape = TileShape{};                                                                // (BLK_M,BLK_N,BLK_K)
+    const auto c_tile_count = CollectiveEpilogue::get_load_pipe_increment(blk_shape);
+    const auto d_tile_count = CollectiveEpilogue::get_store_pipe_increment(blk_shape);
 
     TileScheduler scheduler{params.scheduler};
 
@@ -484,7 +499,7 @@ class GemmUniversal<
     }
 
     // Optionally append 1s until problem shape is rank-4 in case it is only rank-3 (MNK)
-    auto problem_shape_MNKL = append<4>(params.problem_shape.get_problem_shape(work_tile_info.L_idx), Int<1>{});
+    auto problem_shape_MNKL = append<4>(params.problem_shape.get_problem_shape(work_tile_info.L_idx), 1);
 
     // Prepare and partition the input tensors. Expects a tuple of tensors where:
     // get<0>(load_inputs) is the tma tensor A after local tiling so that it has shape (BLK_M,BLK_K,m,k,l)
@@ -510,7 +525,7 @@ class GemmUniversal<
         int32_t const sm_count = params.hw_info.sm_count;
 
         // Fetch a copy of tensormaps for the CTA
-        auto input_tensormaps = collective_mainloop.tensormaps_init(params.mainloop, sm_count, sm_idx);
+        auto input_tensormaps = collective_mainloop.tensormaps_init(params.mainloop, shared_storage.tensormaps.mainloop, sm_count, sm_idx);
 
         // Update tensormap for the initial batch for the CTA
         if (work_tile_info.is_valid()) {
@@ -531,7 +546,8 @@ class GemmUniversal<
         bool did_batch_change = true;
         while (work_tile_info.is_valid()) {
           if (!TileScheduler::valid_warpgroup_in_work_tile(work_tile_info)) {
-            work_tile_info = scheduler.fetch_next_work(work_tile_info);
+            auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info);
+            work_tile_info = next_work_tile_info;
             continue;
           }
 
@@ -572,13 +588,14 @@ class GemmUniversal<
           }
 
           // Get next work tile
-          work_tile_info = scheduler.fetch_next_work(work_tile_info);
+          auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info);
+          work_tile_info = next_work_tile_info;
           auto next_batch = idx2crd(work_tile_info.L_idx, shape<4>(gB_nkl)); // Usually just returns work_tile_info.L_idx
           did_batch_change = next_batch != curr_batch;
           if (work_tile_info.is_valid() && did_batch_change) {
             curr_batch = next_batch;
             if constexpr (IsGroupedGemmKernel) {
-              problem_shape_MNKL = append<4>(params.problem_shape.get_problem_shape(curr_batch), Int<1>{});
+              problem_shape_MNKL = append<4>(params.problem_shape.get_problem_shape(curr_batch), 1);
             }
             // Purpose of this pipeline state is to make sure TMA loads have finished before doing descriptor updates
             // Since this state is waiting for loads to finish, it must start in the inverted phase.
@@ -610,7 +627,7 @@ class GemmUniversal<
         int32_t const sm_idx = blockIdx.x + (blockIdx.y * gridDim.x);
         int32_t const sm_count = params.hw_info.sm_count;
 
-        auto epi_load_tensormap = get<0>(collective_epilogue.load_init(params.epilogue, sm_count, sm_idx));
+        auto epi_load_tensormap = get<0>(collective_epilogue.load_init(params.epilogue, shared_storage.tensormaps.epilogue, sm_count, sm_idx));
 
         bool did_batch_change = true;
         constexpr bool IsEpiLoad = true;
@@ -620,23 +637,27 @@ class GemmUniversal<
             shared_storage.tensormaps.epilogue,
             params.epilogue,
             epi_load_tensormap,
-            work_tile_info.L_idx
+            problem_shape_MNKL,
+            work_tile_info.L_idx,
+            0
           );
 
           // Converge before issuing tensormap fence release since fence is aligned
           __syncwarp();
-          collective_epilogue.tensormaps_cp_fence_release<IsEpiLoad>(shared_storage.tensormaps.epilogue, epi_load_tensormap, lane_predicate);
+          collective_epilogue.tensormaps_cp_fence_release<IsEpiLoad>(shared_storage.tensormaps.epilogue, epi_load_tensormap, 0);
         }
 
         load_order_barrier.wait();
+
         while (work_tile_info.is_valid()) {
           int32_t curr_batch = work_tile_info.L_idx;
 
-          bool compute_epilogue = TileScheduler::compute_epilogue(work_tile_info, params.scheduler);
+          // Get next work tile
+          auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info);
 
-          if (compute_epilogue) {
+          if (TileScheduler::compute_epilogue(work_tile_info, params.scheduler)) {
             if constexpr (IsGroupedGemmKernel) {
-              problem_shape_MNKL = append<4>(params.problem_shape.get_problem_shape(work_tile_info.L_idx), Int<1>{});
+              problem_shape_MNKL = append<4>(params.problem_shape.get_problem_shape(work_tile_info.L_idx), 1);
             }
 
             // Compute m_coord, n_coord, l_coord with the post-tiled m-shape and n-shape
@@ -649,6 +670,8 @@ class GemmUniversal<
               collective_epilogue.tensormaps_fence_acquire<IsEpiLoad>(epi_load_tensormap);
             }
 
+            bool wait = work_tile_info.is_valid() && curr_batch != next_work_tile_info.L_idx;
+
             epi_load_pipe_producer_state = collective_epilogue.load(
               epi_load_pipeline,
               epi_load_pipe_producer_state,
@@ -660,36 +683,33 @@ class GemmUniversal<
               shared_storage.tensors.epilogue,
               epi_load_tensormap,
               work_tile_info.reduction_subtile_idx(),
-              true // return state prior to last advance
+              wait
             );
-
           }
 
-          // Get next work tile
-          work_tile_info = scheduler.fetch_next_work(work_tile_info);
+          work_tile_info = next_work_tile_info;
           did_batch_change = curr_batch != work_tile_info.L_idx;
 
           if (work_tile_info.is_valid() && did_batch_change) {
-            // Wait for TMA load to finish before updating
-            typename CollectiveEpilogue::LoadPipelineState epi_load_pipe_tma_consumer_state =
-              {epi_load_pipe_producer_state.index(), !epi_load_pipe_producer_state.phase(), epi_load_pipe_producer_state.count()};
-
-            epi_load_pipeline.consumer_wait(epi_load_pipe_tma_consumer_state);
-
-            collective_epilogue.tensormaps_perform_update<IsEpiLoad>(
-              shared_storage.tensormaps.epilogue,
-              params.epilogue,
-              epi_load_tensormap,
-              work_tile_info.L_idx
-            );
-
-            // Converge before issuing tensormap fence release since fence is aligned
-            __syncwarp();
-            collective_epilogue.tensormaps_cp_fence_release<IsEpiLoad>(shared_storage.tensormaps.epilogue, epi_load_tensormap, lane_predicate);
-          }
+            if constexpr (IsGroupedGemmKernel) {
+              problem_shape_MNKL = append<4>(params.problem_shape.get_problem_shape(work_tile_info.L_idx), 1);
+            }
 
-          if(compute_epilogue) {
-            epi_load_pipe_producer_state.advance(1);
+            // tensormap update
+            {
+              collective_epilogue.tensormaps_perform_update<IsEpiLoad>(
+                shared_storage.tensormaps.epilogue,
+                params.epilogue,
+                epi_load_tensormap,
+                problem_shape_MNKL,
+                work_tile_info.L_idx,
+                0
+              );
+
+              // Converge before issuing tensormap fence release since fence is aligned
+              __syncwarp();
+              collective_epilogue.tensormaps_cp_fence_release<IsEpiLoad>(shared_storage.tensormaps.epilogue, epi_load_tensormap, 0);
+            }
           }
 
         } // Scheduler work fetch loop
@@ -702,32 +722,42 @@ class GemmUniversal<
     else if (warp_group_role == WarpGroupRole::Consumer0 || warp_group_role == WarpGroupRole::Consumer1) {
       cutlass::arch::warpgroup_reg_alloc<MmaRegisterRequirement>();
 
+      // Index of warp group within consumer warp groups
+      int consumer_warp_group_idx = warp_group_role == WarpGroupRole::Consumer0 ? 0 : 1;
+
       int32_t const sm_idx = blockIdx.x + (blockIdx.y * gridDim.x);
       int32_t const sm_count = params.hw_info.sm_count;
       // Do we potentially issue tail arrives for TMA stores, if epilogue load is waiting for it
       bool do_store_tail = false;
       // Get a copy of tensormaps
-      auto epi_store_tensormap = get<0>(collective_epilogue.store_init(params.epilogue, sm_count, sm_idx));
+      auto epi_store_tensormap = get<0>(collective_epilogue.store_init(params.epilogue, shared_storage.tensormaps.epilogue, sm_count, sm_idx, consumer_warp_group_idx));
 
       bool did_batch_change = true;
       constexpr bool IsEpiLoad = false;
 
       if (work_tile_info.is_valid()) {
-        collective_epilogue.tensormaps_perform_update<IsEpiLoad>(
-          shared_storage.tensormaps.epilogue,
-          params.epilogue,
-          epi_store_tensormap,
-          work_tile_info.L_idx
-        );
 
-        // Converge before issuing tensormap fence release since fence is aligned
-        __syncwarp();
-        collective_epilogue.tensormaps_cp_fence_release<IsEpiLoad>(shared_storage.tensormaps.epilogue, epi_store_tensormap, lane_predicate);
+        if (warp_idx_in_warp_group == 0) {
+          collective_epilogue.tensormaps_perform_update<IsEpiLoad>(
+            shared_storage.tensormaps.epilogue,
+            params.epilogue,
+            epi_store_tensormap,
+            problem_shape_MNKL,
+            work_tile_info.L_idx,
+            consumer_warp_group_idx
+          );
+
+          // Converge before issuing tensormap fence release since fence is aligned
+          __syncwarp();
+          collective_epilogue.tensormaps_cp_fence_release<IsEpiLoad>(shared_storage.tensormaps.epilogue,
+                                                                     epi_store_tensormap,
+                                                                     consumer_warp_group_idx);
+        }
       }
 
       while (work_tile_info.is_valid()) {
         if constexpr (IsGroupedGemmKernel) {
-          problem_shape_MNKL = append<4>(params.problem_shape.get_problem_shape(work_tile_info.L_idx), Int<1>{});
+          problem_shape_MNKL = append<4>(params.problem_shape.get_problem_shape(work_tile_info.L_idx), 1);
         }
 
         int32_t curr_batch = work_tile_info.L_idx;
@@ -743,7 +773,12 @@ class GemmUniversal<
         //
         // MSVC CTAD breaks if we say "Tensor" here, so we use "auto" instead.
         auto accumulators = partition_fragment_C(tiled_mma, take<0,2>(blk_shape));               // (MMA,MMA_M,MMA_N)
-        if(TileScheduler::valid_warpgroup_in_work_tile(work_tile_info)) {
+
+        static_assert(cute::is_any_of_v<TileScheduler,
+            detail::PersistentTileSchedulerSm90Group<ProblemShape>,
+            detail::PersistentTileSchedulerSm90>);
+        if (TileScheduler::valid_warpgroup_in_work_tile(work_tile_info)) {
+
           collective_mainloop.mma(
             mainloop_pipeline,
             mainloop_pipe_consumer_state,
@@ -764,18 +799,16 @@ class GemmUniversal<
           // Update starting mainloop pipeline state for the next tile
           mainloop_pipe_consumer_state.advance(work_k_tile_count);
         }
-        // Index of warp group within consumer warp groups
-        int consumer_warp_group_idx = canonical_warp_group_idx() - NumLoadWarpGroups;
 
         // Perform reduction across splits, if needed
         TileScheduler::fixup(
           params.scheduler, work_tile_info, accumulators, NumMmaWarpGroups, consumer_warp_group_idx);
 
-        if (TileScheduler::compute_epilogue(work_tile_info, params.scheduler)) {
+        if (did_batch_change) {
+          collective_epilogue.tensormaps_fence_acquire<IsEpiLoad>(epi_store_tensormap);
+        }
 
-          if (did_batch_change) {
-            collective_epilogue.tensormaps_fence_acquire<IsEpiLoad>(epi_store_tensormap);
-          }
+        if (TileScheduler::compute_epilogue(work_tile_info, params.scheduler)) {
 
           // Epilogue and write to gD
           auto [epi_load_pipe_consumer_state_next, epi_store_pipe_producer_state_next] =
@@ -794,30 +827,42 @@ class GemmUniversal<
             epi_store_tensormap,
             work_tile_info.reduction_subtile_idx()
           );
+
           epi_load_pipe_consumer_state = epi_load_pipe_consumer_state_next;
           epi_store_pipe_producer_state = epi_store_pipe_producer_state_next;
           do_store_tail = true;
         }
 
         // Get next work tile
-        work_tile_info = scheduler.fetch_next_work(work_tile_info);
+        auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info);
+        work_tile_info = next_work_tile_info;
 
         did_batch_change = curr_batch != work_tile_info.L_idx;
         if (work_tile_info.is_valid() && did_batch_change) {
-          collective_epilogue.tensormaps_perform_update<IsEpiLoad>(
-            shared_storage.tensormaps.epilogue,
-            params.epilogue,
-            epi_store_tensormap,
-            work_tile_info.L_idx
-          );
+          if constexpr (IsGroupedGemmKernel) {
+            problem_shape_MNKL = append<4>(params.problem_shape.get_problem_shape(work_tile_info.L_idx), 1);
+          }
+          if (warp_idx_in_warp_group == 0) {
+            collective_epilogue.tensormaps_perform_update<IsEpiLoad>(
+              shared_storage.tensormaps.epilogue,
+              params.epilogue,
+              epi_store_tensormap,
+              problem_shape_MNKL,
+              work_tile_info.L_idx,
+              consumer_warp_group_idx
+            );
 
-          // Converge before issuing tensormap fence release since fence is aligned
-          __syncwarp();
-          collective_epilogue.tensormaps_cp_fence_release<IsEpiLoad>(shared_storage.tensormaps.epilogue, epi_store_tensormap, lane_predicate);
+            // Converge before issuing tensormap fence release since fence is aligned
+            __syncwarp();
+            collective_epilogue.tensormaps_cp_fence_release<IsEpiLoad>(shared_storage.tensormaps.epilogue,
+                                                                       epi_store_tensormap,
+                                                                       consumer_warp_group_idx);
+          }
         }
 
       } // Scheduler work fetch loop
 
+      // Cooperative only needs TMA to complete at the very end of the kernel
       if (do_store_tail) {
         collective_epilogue.store_tail(
           epi_load_pipeline,
@@ -829,7 +874,6 @@ class GemmUniversal<
     } // Consumer Warp Groups End
 #endif
   }
-
 };
 
 ///////////////////////////////////////////////////////////////////////////////
diff --git a/include/cutlass/gemm/kernel/sm90_gemm_array_tma_warpspecialized_pingpong.hpp b/include/cutlass/gemm/kernel/sm90_gemm_array_tma_warpspecialized_pingpong.hpp
new file mode 100644
index 0000000000..386337641d
--- /dev/null
+++ b/include/cutlass/gemm/kernel/sm90_gemm_array_tma_warpspecialized_pingpong.hpp
@@ -0,0 +1,946 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+#pragma once
+
+#include "cutlass/cutlass.h"
+#include "cutlass/workspace.h"
+#include "cutlass/fast_math.h"
+#include "cutlass/kernel_hardware_info.hpp"
+#include "cute/arch/cluster_sm90.hpp"
+#include "cutlass/arch/reg_reconfig.h"
+#include "cutlass/arch/mma_sm90.h"
+#include "cutlass/epilogue/collective/detail.hpp"
+#include "cutlass/gemm/gemm.h"
+#include "cutlass/gemm/dispatch_policy.hpp"
+#include "cutlass/gemm/kernel/gemm_universal_decl.h"
+#include "cutlass/gemm/kernel/tile_scheduler.hpp"
+#include "cutlass/gemm/group_array_problem_shape.hpp"
+#include "cutlass/pipeline/pipeline.hpp"
+#include "cute/tensor.hpp"
+#include "cutlass/trace.h"
+#include "cutlass/gemm/kernel/sm90_tile_scheduler.hpp"
+#include "cutlass/gemm/kernel/sm90_tile_scheduler_group.hpp"
+
+///////////////////////////////////////////////////////////////////////////////
+
+namespace cutlass::gemm::kernel {
+
+///////////////////////////////////////////////////////////////////////////////
+
+template <
+  class ProblemShape_,
+  class CollectiveMainloop_,
+  class CollectiveEpilogue_,
+  class TileScheduler_
+>
+class GemmUniversal<
+  ProblemShape_,
+  CollectiveMainloop_,
+  CollectiveEpilogue_,
+  TileScheduler_,
+  cute::enable_if_t<cute::is_base_of_v<KernelPtrArrayTmaWarpSpecializedPingpong, typename CollectiveMainloop_::DispatchPolicy::Schedule>>
+>
+{
+public:
+  //
+  // Type Aliases
+  //
+  using ProblemShape = ProblemShape_;
+  static_assert(rank(typename ProblemShape::UnderlyingProblemShape{}) == 3 or rank(typename ProblemShape::UnderlyingProblemShape{}) == 4,
+    "ProblemShape{} should be <M,N,K> or <M,N,K,L>");
+
+  static_assert(cute::is_base_of_v<KernelPtrArrayTmaWarpSpecializedPingpong, typename CollectiveMainloop_::DispatchPolicy::Schedule>);
+
+  static constexpr bool IsGdcEnabled = false;
+
+  // Mainloop derived types
+  using CollectiveMainloop = CollectiveMainloop_;
+  using TileShape = typename CollectiveMainloop::TileShape;
+  using TiledMma  = typename CollectiveMainloop::TiledMma;
+  using ArchTag   = typename CollectiveMainloop::ArchTag;
+  using ElementA  = typename CollectiveMainloop::ElementA;
+  using StrideA   = typename CollectiveMainloop::StrideA;
+  using InternalStrideA = typename CollectiveMainloop::InternalStrideA;
+  using ElementB  = typename CollectiveMainloop::ElementB;
+  using InternalStrideB = typename CollectiveMainloop::InternalStrideB;
+  using StrideB   = typename CollectiveMainloop::StrideB;
+  using DispatchPolicy = typename CollectiveMainloop::DispatchPolicy;
+  using Schedule = typename DispatchPolicy::Schedule;
+  using ElementAccumulator = typename CollectiveMainloop::ElementAccumulator;
+  using ClusterShape = typename DispatchPolicy::ClusterShape;
+  using MainloopArguments = typename CollectiveMainloop::Arguments;
+  using MainloopParams = typename CollectiveMainloop::Params;
+
+  // Epilogue derived types
+  using CollectiveEpilogue = CollectiveEpilogue_;
+  using ElementC = typename CollectiveEpilogue::ElementC;
+  using StrideC  = typename CollectiveEpilogue::StrideC;
+  using InternalStrideC = typename CollectiveEpilogue::InternalStrideC;
+  using ElementD = typename CollectiveEpilogue::ElementD;
+  using StrideD  = typename CollectiveEpilogue::StrideD;
+  using InternalStrideD = typename CollectiveEpilogue::InternalStrideD;
+  using EpilogueArguments = typename CollectiveEpilogue::Arguments;
+  using EpilogueParams = typename CollectiveEpilogue::Params;
+
+  static_assert(ArchTag::kMinComputeCapability >= 90);
+  static_assert(cute::is_void_v<TileScheduler_>,
+    "Ptr-Array Pingpong and Grouped Gemm Pingpong kernel only supports the default scheduler.");
+
+  static constexpr bool IsGroupedGemmKernel = !cute::is_same_v<InternalStrideA, StrideA>;
+
+  using TileScheduler = cute::conditional_t<IsGroupedGemmKernel,
+    typename detail::TileSchedulerSelector<
+      GroupScheduler, ArchTag,
+      TileShape, ClusterShape,
+      ProblemShape>::Scheduler,
+    typename detail::TileSchedulerSelector<
+    void, ArchTag, TileShape, ClusterShape>::Scheduler>;
+  using TileSchedulerArguments = typename TileScheduler::Arguments;
+  using TileSchedulerParams = typename TileScheduler::Params;
+
+  static constexpr uint32_t NumLoadWarpGroups = 1;
+  static constexpr uint32_t NumMmaWarpGroups = 2;
+  static constexpr uint32_t MaxThreadsPerBlock = CUTE_STATIC_V(size(TiledMma{})) + (NumMmaWarpGroups * NumThreadsPerWarpGroup);
+  static constexpr uint32_t MinBlocksPerMultiprocessor = 1;
+
+  /// Register requirement for Load and Math WGs
+  static constexpr uint32_t LoadRegisterRequirement = 40;
+  static constexpr uint32_t MmaRegisterRequirement = 232;
+
+  // 1 stage ordered sequence between mainloop and epilogue producer load threads
+  using LoadWarpOrderBarrier = cutlass::OrderedSequenceBarrier<1,2>;
+
+  // Order Sequence barrier with two stages: one for Mainloop and one for Epilogue
+  static constexpr uint32_t StagesPerMathWarpGroup = 2;
+  using MathWarpGroupOrderBarrier = cutlass::OrderedSequenceBarrier<StagesPerMathWarpGroup, NumMmaWarpGroups>;
+  using MathWarpGroupOrderBarrierSharedStorage = cutlass::PipelineDetail::OrderedSequenceBarrierSharedStorage<
+      MathWarpGroupOrderBarrier::SequenceDepth,
+      MathWarpGroupOrderBarrier::SequenceLength>;
+
+  // Kernel level shared memory storage
+  struct SharedStorage {
+    struct TensorStorage : cute::aligned_struct<128, _1> {
+      using MainloopTensorStorage = typename CollectiveMainloop::TensorStorage;
+      using EpilogueTensorStorage = typename CollectiveEpilogue::TensorStorage;
+
+      MainloopTensorStorage mainloop;
+      EpilogueTensorStorage epilogue;
+    } tensors;
+
+    struct PipelineStorage : cute::aligned_struct<16, _1> {
+      using MainloopPipelineStorage = typename CollectiveMainloop::PipelineStorage;
+      using EpiLoadPipelineStorage = typename CollectiveEpilogue::PipelineStorage;
+      using MathWarpGroupOrderBarrierStorage = MathWarpGroupOrderBarrierSharedStorage;
+
+      alignas(16) MainloopPipelineStorage mainloop;
+      alignas(16) EpiLoadPipelineStorage epi_load;
+      alignas(16) typename LoadWarpOrderBarrier::SharedStorage load_order;
+      alignas(16) MathWarpGroupOrderBarrierStorage math_wg_order;
+    } pipelines;
+
+    struct TensorMapStorage : cute::aligned_struct<128, _1> {
+      using MainloopTensorMapStorage = typename CollectiveMainloop::TensorMapStorage;
+      using EpilogueTensorMapStorage = typename CollectiveEpilogue::TensorMapStorage;
+
+      alignas(128) MainloopTensorMapStorage mainloop;
+      alignas(128) EpilogueTensorMapStorage epilogue;
+    } tensormaps;
+  };
+
+  static constexpr int SharedStorageSize = sizeof(SharedStorage);
+
+  // Device side arguments
+  struct Arguments {
+    GemmUniversalMode mode{};
+    ProblemShape problem_shape{};
+    MainloopArguments mainloop{};
+    EpilogueArguments epilogue{};
+    KernelHardwareInfo hw_info{};
+    TileSchedulerArguments scheduler{};
+  };
+
+  // Kernel entry point API
+  struct Params {
+    GemmUniversalMode mode{};
+    ProblemShape problem_shape{};
+    MainloopParams mainloop{};
+    EpilogueParams epilogue{};
+    KernelHardwareInfo hw_info{};
+    TileSchedulerParams scheduler{};
+    void* workspace{nullptr};
+  };
+
+  //
+  // Methods
+  //
+
+  // Convert to underlying arguments. In this case, a simple copy for the aliased type.
+  static
+  Params
+  to_underlying_arguments(Arguments const& args, void* workspace) {
+    CUTLASS_TRACE_HOST("to_underlying_arguments():");
+
+    ProblemShape problem_shapes = args.problem_shape;
+
+    // Get SM count if needed, otherwise use user supplied SM count
+    int sm_count = args.hw_info.sm_count;
+    if (sm_count <= 0) {
+      CUTLASS_TRACE_HOST("  WARNING: Arguments do not include a valid SM count.\n"
+          "  For optimal performance, populate the arguments KernelHardwareInfo struct with the SM count.");
+      sm_count = KernelHardwareInfo::query_device_multiprocessor_count(args.hw_info.device_id);
+    }
+
+    CUTLASS_TRACE_HOST("to_underlying_arguments(): Setting persistent grid SM count to " << sm_count);
+
+    KernelHardwareInfo hw_info{args.hw_info.device_id, sm_count};
+
+    // Calculate workspace pointers
+    uint8_t* workspace_ptr = reinterpret_cast<uint8_t*>(workspace);
+    size_t workspace_offset = 0;
+
+    void* scheduler_workspace = workspace_ptr;
+    workspace_offset += TileScheduler::template get_workspace_size<typename ProblemShape::UnderlyingProblemShape, ElementAccumulator>(
+      args.scheduler, typename ProblemShape::UnderlyingProblemShape{}, args.hw_info, NumMmaWarpGroups);
+    workspace_offset = round_nearest(workspace_offset,  MinWorkspaceAlignment);
+
+    void* epilogue_workspace = workspace_ptr + workspace_offset;
+    workspace_offset += CollectiveEpilogue::get_workspace_size(problem_shapes, args.epilogue, sm_count);
+    workspace_offset = round_nearest(workspace_offset,  MinWorkspaceAlignment);
+
+    void* mainloop_workspace = workspace_ptr + workspace_offset;
+    workspace_offset += CollectiveMainloop::get_workspace_size(problem_shapes, args.mainloop, sm_count);
+    workspace_offset = round_nearest(workspace_offset,  MinWorkspaceAlignment);
+
+    // Precompute the sub tiles numbers in epilogue, pass into tile scheduler.  Therefore it will be used
+    // in separate reduction scheme for streamk case, NumEpilogueSubTiles default value is 1, which means
+    // subtile will not be used, therefore separate reduction will not be enabled.
+    constexpr uint32_t NumEpilogueSubTiles = CollectiveEpilogue::get_store_pipe_increment(TileShape{});
+    TileSchedulerParams scheduler;
+    if constexpr (IsGroupedGemmKernel) {
+      scheduler = TileScheduler::to_underlying_arguments(
+      problem_shapes, TileShape{}, ClusterShape{}, hw_info, args.scheduler, scheduler_workspace, NumEpilogueSubTiles);
+    }
+    else {
+      scheduler = TileScheduler::to_underlying_arguments(
+      problem_shapes.get_host_problem_shape(), TileShape{}, ClusterShape{}, hw_info, args.scheduler, scheduler_workspace, NumEpilogueSubTiles);
+    }
+
+    return {
+      args.mode,
+      problem_shapes,
+      CollectiveMainloop::to_underlying_arguments(problem_shapes, args.mainloop, mainloop_workspace),
+      CollectiveEpilogue::to_underlying_arguments(problem_shapes, args.epilogue, epilogue_workspace),
+      hw_info,
+      scheduler,
+      workspace
+    };
+  }
+
+  static bool
+  can_implement(Arguments const& args) {
+    bool implementable = true;
+    if constexpr (IsGroupedGemmKernel) {
+      // Group GEMM currently only supports rank-3 problem shapes
+      implementable &= (args.mode == GemmUniversalMode::kGrouped && rank(typename ProblemShape::UnderlyingProblemShape{}) == 3);
+    } else {
+      implementable &= (args.mode == GemmUniversalMode::kArray && rank(typename ProblemShape::UnderlyingProblemShape{}) == 4);
+    }
+    if (!implementable) {
+      CUTLASS_TRACE_HOST("  CAN IMPLEMENT: Arguments or Problem Shape don't meet the requirements for Ptr Array Gemm or Grouped Gemm.\n");
+      return implementable;
+    }
+    implementable &= CollectiveMainloop::can_implement(args.problem_shape, args.mainloop);
+    implementable &= CollectiveEpilogue::can_implement(args.problem_shape, args.epilogue);
+    implementable &= TileScheduler::can_implement(args.scheduler);
+    return implementable;
+  }
+
+  static size_t
+  get_workspace_size(Arguments const& args) {
+    size_t workspace_size = 0;
+    constexpr uint32_t NumEpilogueSubTiles = CollectiveEpilogue::get_store_pipe_increment(TileShape{});
+
+    workspace_size += TileScheduler::template get_workspace_size<typename ProblemShape::UnderlyingProblemShape, ElementAccumulator>(
+      args.scheduler, typename ProblemShape::UnderlyingProblemShape{}, args.hw_info, NumMmaWarpGroups, NumEpilogueSubTiles);
+    workspace_size = round_nearest(workspace_size,  MinWorkspaceAlignment);
+
+    // Get SM count if needed, otherwise use user supplied SM count
+    int sm_count = args.hw_info.sm_count;
+    if (sm_count <= 0) {
+      CUTLASS_TRACE_HOST("  WARNING: Arguments do not include a valid SM count.\n"
+          "  For optimal performance, populate the arguments KernelHardwareInfo struct with the SM count.");
+      sm_count = KernelHardwareInfo::query_device_multiprocessor_count(args.hw_info.device_id);
+    }
+
+    workspace_size += CollectiveEpilogue::get_workspace_size(args.problem_shape, args.epilogue, sm_count);
+    workspace_size = round_nearest(workspace_size,  MinWorkspaceAlignment);
+
+    workspace_size += CollectiveMainloop::get_workspace_size(args.problem_shape, args.mainloop, sm_count);
+    workspace_size = round_nearest(workspace_size,  MinWorkspaceAlignment);
+
+    return workspace_size;
+  }
+
+  static cutlass::Status
+  initialize_workspace(Arguments const& args, void* workspace = nullptr, cudaStream_t stream = nullptr,
+    CudaHostAdapter* cuda_adapter = nullptr) {
+    Status status = Status::kSuccess;
+    uint8_t* workspace_ptr = reinterpret_cast<uint8_t*>(workspace);
+    size_t workspace_offset = 0;
+    constexpr uint32_t NumEpilogueSubTiles = CollectiveEpilogue::get_store_pipe_increment(TileShape{});
+    static constexpr uint32_t NumAccumulatorMtxs = 1;
+
+    status = TileScheduler::template initialize_workspace<typename ProblemShape::UnderlyingProblemShape, ElementAccumulator>(
+      args.scheduler, workspace_ptr + workspace_offset, stream, typename ProblemShape::UnderlyingProblemShape{}, args.hw_info, NumMmaWarpGroups, NumEpilogueSubTiles, NumAccumulatorMtxs, cuda_adapter);
+    workspace_offset += TileScheduler::template get_workspace_size<typename ProblemShape::UnderlyingProblemShape, ElementAccumulator>(
+      args.scheduler, typename ProblemShape::UnderlyingProblemShape{}, args.hw_info, NumMmaWarpGroups, NumEpilogueSubTiles);
+    workspace_offset = round_nearest(workspace_offset,  MinWorkspaceAlignment);
+    if (status != Status::kSuccess) {
+      return status;
+    }
+
+    status = CollectiveEpilogue::initialize_workspace(args.problem_shape, args.epilogue, workspace_ptr + workspace_offset, stream, cuda_adapter);
+    workspace_offset += CollectiveEpilogue::get_workspace_size(args.problem_shape, args.epilogue, args.hw_info.sm_count);
+    workspace_offset = round_nearest(workspace_offset,  MinWorkspaceAlignment);
+
+    status = CollectiveMainloop::initialize_workspace(args.problem_shape, args.mainloop, workspace_ptr + workspace_offset, stream, cuda_adapter);
+    workspace_offset += CollectiveMainloop::get_workspace_size(args.problem_shape, args.mainloop, args.hw_info.sm_count);
+    workspace_offset = round_nearest(workspace_offset,  MinWorkspaceAlignment);
+
+    if (status != Status::kSuccess) {
+      return status;
+    }
+
+    return status;
+  }
+
+  // Computes the kernel launch grid shape based on runtime parameters
+  static dim3
+  get_grid_shape(Params const& params) {
+    // Given device SM count, set grid size s.t. we do not launch more thread blocks than we can run concurrently
+    TileSchedulerArguments args{};
+    if constexpr (!std::is_const_v<decltype(args.max_swizzle_size)>) {
+      args.max_swizzle_size = 1 << params.scheduler.log_swizzle_size_;
+    }
+    args.raster_order = params.scheduler.raster_order_ == TileScheduler::RasterOrder::AlongN ? TileScheduler::RasterOrderOptions::AlongN : TileScheduler::RasterOrderOptions::AlongM;
+    dim3 grid_shape;
+    if constexpr (IsGroupedGemmKernel) {
+      grid_shape = TileScheduler::get_grid_shape(params.scheduler, params.problem_shape, TileShape{}, ClusterShape{}, params.hw_info, args);
+    }
+    else {
+      grid_shape = TileScheduler::get_grid_shape(params.scheduler, params.problem_shape.get_host_problem_shape(), TileShape{}, ClusterShape{}, params.hw_info, args);
+    }
+    return grid_shape;
+  }
+
+  static dim3
+  get_block_shape() {
+    return dim3(MaxThreadsPerBlock, 1, 1);
+  }
+
+  CUTLASS_DEVICE
+  void
+  operator()(Params const& params, char* smem_buf) {
+    using namespace cute;
+    using X = Underscore;
+
+// Any Tensor Op MMA Atom in the WGMMA ISA is arch conditional to sm90a.
+#if ! defined(__CUDA_ARCH_FEAT_SM90_ALL)
+    printf("ERROR : Arch conditional MMA instruction used without targeting sm90a compute capability. Aborting.\n");
+#else
+
+    // Preconditions
+    static_assert(size(TiledMma{}) == 128, "Pingpong kernel must have TiledMMA operating using 128 threads.");
+    static_assert(NumMmaWarpGroups == 2, "Pingpong kernels currently only support NumMmaWarpGroups == 2");
+
+    if constexpr (cutlass::epilogue::collective::detail::sm90_is_ptr_array_tma_dispatch_policy_v<typename CollectiveEpilogue::DispatchPolicy>) {
+      static_assert(NumMmaWarpGroups == CollectiveEpilogue::NumEpilogueWarpGroups,
+                    "Tiled MmA does not match expected warp groups performing the epilogue");
+    }
+
+    static_assert(cute::rank(InternalStrideA{}) == 3, "StrideA must be rank-3: [M, K, L]. If batch mode is not needed, set L stride to Int<0>.");
+    static_assert(cute::rank(InternalStrideB{}) == 3, "StrideB must be rank-3: [N, K, L]. If batch mode is not needed, set L stride to Int<0>.");
+    static_assert(cute::rank(InternalStrideC{}) == 3, "StrideC must be rank-3: [M, N, L]. If batch mode is not needed, set L stride to Int<0>.");
+    static_assert(cute::rank(InternalStrideD{}) == 3, "StrideD must be rank-3: [M, N, L]. If batch mode is not needed, set L stride to Int<0>.");
+
+    enum class WarpGroupRole {
+      Producer = 0,
+      Consumer0 = 1,
+      Consumer1 = 2
+    };
+    enum class ProducerWarpRole {
+      Mainloop = 0,
+      Warp1 = 1,
+      Epilogue = 2,
+      Warp3 = 3
+    };
+
+    // Kernel level shared memory storage
+    SharedStorage& shared_storage = *reinterpret_cast<SharedStorage*>(smem_buf);
+
+    int thread_idx = int(threadIdx.x);
+    int lane_idx = canonical_lane_idx();
+    int warp_idx = canonical_warp_idx_sync();
+    int warp_idx_in_warp_group = warp_idx % NumWarpsPerWarpGroup;
+    int warp_group_thread_idx = thread_idx % NumThreadsPerWarpGroup;
+    int mma_thread_idx = thread_idx % size(TiledMma{});
+    auto warp_group_idx = canonical_warp_group_idx();
+    auto warp_group_role = WarpGroupRole(warp_group_idx);
+    auto producer_warp_role = ProducerWarpRole(warp_idx_in_warp_group);
+    int lane_predicate = cute::elect_one_sync();
+    uint32_t block_rank_in_cluster = cute::block_rank_in_cluster();
+
+    // Note: Tma Descriptor Prefetch (from either const or param) is not applicable here
+
+    // Mainloop Load pipeline
+    using MainloopPipeline = typename CollectiveMainloop::MainloopPipeline;
+    typename MainloopPipeline::Params mainloop_pipeline_params;
+    if (warp_group_role == WarpGroupRole::Producer && producer_warp_role == ProducerWarpRole::Mainloop) {
+      mainloop_pipeline_params.role = MainloopPipeline::ThreadCategory::Producer;
+    }
+    if (warp_group_role == WarpGroupRole::Consumer0 || warp_group_role == WarpGroupRole::Consumer1) {
+      mainloop_pipeline_params.role = MainloopPipeline::ThreadCategory::Consumer;
+    }
+    mainloop_pipeline_params.is_leader = warp_group_thread_idx == 0;
+    mainloop_pipeline_params.num_consumers = NumThreadsPerWarpGroup;
+    mainloop_pipeline_params.transaction_bytes = params.mainloop.tma_transaction_bytes;
+    MainloopPipeline mainloop_pipeline(shared_storage.pipelines.mainloop, mainloop_pipeline_params, ClusterShape{});
+
+    // Epilogue Load pipeline
+    using EpiLoadPipeline = typename CollectiveEpilogue::LoadPipeline;
+    typename EpiLoadPipeline::Params epi_load_pipeline_params;
+    if (warp_group_role == WarpGroupRole::Producer && producer_warp_role == ProducerWarpRole::Epilogue) {
+      epi_load_pipeline_params.role = EpiLoadPipeline::ThreadCategory::Producer;
+    }
+    if (warp_group_role == WarpGroupRole::Consumer0 || warp_group_role == WarpGroupRole::Consumer1) {
+      epi_load_pipeline_params.role = EpiLoadPipeline::ThreadCategory::Consumer;
+    }
+    epi_load_pipeline_params.dst_blockid = cute::block_rank_in_cluster();
+    epi_load_pipeline_params.producer_arv_count = NumThreadsPerWarp;
+    epi_load_pipeline_params.consumer_arv_count = NumThreadsPerWarpGroup;
+    if constexpr (CollectiveEpilogue::RequiresTransactionBytes) {
+      epi_load_pipeline_params.transaction_bytes = params.epilogue.tma_transaction_bytes;
+    }
+    EpiLoadPipeline epi_load_pipeline(shared_storage.pipelines.epi_load, epi_load_pipeline_params);
+
+    // Epilogue Store pipeline
+    using EpiStorePipeline = typename CollectiveEpilogue::StorePipeline;
+    typename EpiStorePipeline::Params epi_store_pipeline_params;
+    epi_store_pipeline_params.always_wait = true;
+    EpiStorePipeline epi_store_pipeline(epi_store_pipeline_params);
+
+    typename LoadWarpOrderBarrier::Params params_load_order_barrier;
+    params_load_order_barrier.group_id = producer_warp_role == ProducerWarpRole::Mainloop ? 0 : 1;
+    params_load_order_barrier.group_size = NumThreadsPerWarp;
+    LoadWarpOrderBarrier load_order_barrier(shared_storage.pipelines.load_order, params_load_order_barrier);
+
+    typename MathWarpGroupOrderBarrier::Params params_math_wg_order_barrier;
+    // DMA Load WG will not participate in these Ordered Barrier syncs
+    params_math_wg_order_barrier.group_id = warp_group_idx - static_cast<int>(WarpGroupRole::Consumer0);
+    params_math_wg_order_barrier.group_size = NumThreadsPerWarpGroup; // Number of threads / participants in a group
+    MathWarpGroupOrderBarrier math_wg_order_barrier(shared_storage.pipelines.math_wg_order, params_math_wg_order_barrier);
+
+    // Initialize starting pipeline states for the collectives
+    // Epilogue store pipe is producer-only (consumer is TMA unit, waits via scoreboarding)
+    typename CollectiveMainloop::PipelineState mainloop_pipe_consumer_state;
+    typename CollectiveEpilogue::LoadPipelineState epi_load_pipe_consumer_state;
+
+    // For the DMA Load (producer) we start with an opposite phase
+    // i.e., we skip all waits since we know that the buffer is indeed empty
+    PipelineState mainloop_pipe_producer_state = cutlass::make_producer_start_state<MainloopPipeline>();
+    PipelineState epi_load_pipe_producer_state = cutlass::make_producer_start_state<EpiLoadPipeline>();
+    PipelineState epi_store_pipe_producer_state = cutlass::make_producer_start_state<EpiStorePipeline>();
+
+    auto cluster_wait_fn = [] () {
+      // We need this to guarantee that the Pipeline init is visible
+      // To all producers and consumer thread blocks in the Cluster
+      if constexpr (size(ClusterShape{}) > 1) {
+        cute::cluster_arrive_relaxed();
+        return [] () { cute::cluster_wait(); };
+      }
+      else {
+        __syncthreads();
+        return [] () {}; // do nothing
+      }
+    } ();
+
+    // Get the appropriate blocks for this thread block -- potential for thread block locality
+    TiledMma tiled_mma;
+    const auto blk_shape = TileShape{};                                                                // (BLK_M,BLK_N,BLK_K)
+    const auto c_tile_count = CollectiveEpilogue::get_load_pipe_increment(blk_shape);
+    const auto d_tile_count = CollectiveEpilogue::get_store_pipe_increment(blk_shape);
+
+    TileScheduler scheduler{params.scheduler};
+
+    // In a warp specialized kernel, collectives expose data movement and compute operations separately
+    CollectiveMainloop collective_mainloop;
+    CollectiveEpilogue collective_epilogue(params.epilogue, shared_storage.tensors.epilogue);
+
+    // Wait for all thread blocks in the Cluster
+    cluster_wait_fn();
+
+    auto work_tile_info = scheduler.initial_work_tile_info(ClusterShape{});
+    if (not work_tile_info.is_valid()) {
+      // When problem shapes are only on device, the grid launched may be larger than the total number of blocks across groups
+      return;
+    }
+
+    // Optionally append 1s until problem shape is rank-4 in case it is only rank-3 (MNK)
+    auto problem_shape_MNKL = append<4>(params.problem_shape.get_problem_shape(work_tile_info.L_idx), 1);
+
+    if (warp_group_role == WarpGroupRole::Consumer1) {
+      // Advance 2nd Math WG to the next work tile for the startup
+      const auto k_tile_count = TileScheduler::get_work_k_tile_count(work_tile_info, problem_shape_MNKL, blk_shape);
+
+      auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info);
+      work_tile_info = next_work_tile_info;
+      if (!work_tile_info.is_valid()) {
+        return;
+      }
+
+      // Advance 2nd Math WG pipeline states to the end of 1st Math WG
+      mainloop_pipe_consumer_state.advance(k_tile_count);
+      epi_load_pipe_consumer_state.advance(c_tile_count);
+      epi_store_pipe_producer_state.advance(d_tile_count);
+
+      problem_shape_MNKL = append<4>(params.problem_shape.get_problem_shape(work_tile_info.L_idx), 1);
+    }
+
+    // Prepare and partition the input tensors. Expects a tuple of tensors where:
+    // get<0>(load_inputs) is the tma tensor A after local tiling so that it has shape (BLK_M,BLK_K,m,k,l)
+    // get<1>(load_inputs) is the tma tensor B after local tiling so that it has shape (BLK_N,BLK_K,n,k,l)
+    auto load_inputs = collective_mainloop.load_init(problem_shape_MNKL, params.mainloop);
+    static_assert(cute::tuple_size_v<decltype(load_inputs)> >= 2, "Output of load_init must have at least two elements (A, B)");
+
+    // Extract out partitioned A and B.
+    Tensor gA_mkl = get<0>(load_inputs);
+    Tensor gB_nkl = get<1>(load_inputs);
+
+    // Get pipeline stage increments from tensor shapes
+    auto k_tile_count = size<3>(gA_mkl);
+
+    if (warp_group_role == WarpGroupRole::Producer) {
+      cutlass::arch::warpgroup_reg_dealloc<LoadRegisterRequirement>();
+
+      // Mainloop Producer Warp
+      if (producer_warp_role == ProducerWarpRole::Mainloop) {
+        int32_t curr_batch = idx2crd(work_tile_info.L_idx, shape<4>(gB_nkl)); // Usually just returns work_tile_info.L_idx;
+        int32_t const mock_l_coord = 0;
+        int32_t const sm_idx = blockIdx.x + (blockIdx.y * gridDim.x);
+        int32_t const sm_count = params.hw_info.sm_count;
+
+        // Fetch a copy of tensormaps for the CTA
+        auto input_tensormaps = collective_mainloop.tensormaps_init(params.mainloop, shared_storage.tensormaps.mainloop, sm_count, sm_idx);
+
+        // Update tensormap for the initial batch for the CTA
+        if (work_tile_info.is_valid()) {
+          collective_mainloop.tensormaps_perform_update(
+            shared_storage.tensormaps.mainloop,
+            params.mainloop,
+            input_tensormaps,
+            problem_shape_MNKL,
+            curr_batch
+          );
+          // Ensure warp is converged before issuing tensormap fence release
+          __syncwarp();
+          // Entire warp must do this (i.e. it's aligned)
+          collective_mainloop.tensormaps_cp_fence_release(shared_storage.tensormaps.mainloop, input_tensormaps);
+        }
+
+        bool do_load_order_arrive = true;
+        bool did_batch_change = true;
+        while (work_tile_info.is_valid()) {
+          if (!TileScheduler::valid_warpgroup_in_work_tile(work_tile_info)) {
+            auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info);
+            work_tile_info = next_work_tile_info;
+            continue;
+          }
+
+          // Compute m_coord, n_coord, l_coord with the post-tiled m-shape and n-shape
+          auto m_coord = idx2crd(work_tile_info.M_idx, shape<2>(gA_mkl));
+          auto n_coord = idx2crd(work_tile_info.N_idx, shape<2>(gB_nkl));
+          auto blk_coord = make_coord(m_coord, n_coord, _, mock_l_coord);
+
+          // Get the number of K tiles to compute for this work as well as the starting K tile offset of the work.
+          auto work_k_tile_count = TileScheduler::get_work_k_tile_count(work_tile_info, problem_shape_MNKL, blk_shape);
+          auto work_k_tile_start = TileScheduler::get_work_k_tile_start(work_tile_info);
+          auto k_tile_iter = cute::make_coord_iterator(idx2crd(work_k_tile_start, shape<3>(gA_mkl)), shape<3>(gA_mkl));
+
+          if (did_batch_change) {
+            collective_mainloop.tensormaps_fence_acquire(input_tensormaps);
+          }
+
+          collective_mainloop.load(
+            params.mainloop,
+            mainloop_pipeline,
+            mainloop_pipe_producer_state,
+            load_inputs,
+            input_tensormaps,
+            blk_coord,
+            k_tile_iter, work_k_tile_count,
+            lane_idx,
+            block_rank_in_cluster,
+            shared_storage.tensors.mainloop
+          );
+          // Update starting pipeline state for the next tile
+          // Wait for the last TMA stage to complete loading, before issuing tensormap updates
+          mainloop_pipe_producer_state.advance(work_k_tile_count - 1);
+
+          // Signal for the epilogue load warp to begin
+          if (do_load_order_arrive) {
+            load_order_barrier.arrive();
+            do_load_order_arrive = false;
+          }
+
+          // Get next work tile
+          auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info);
+          work_tile_info = next_work_tile_info;
+          auto next_batch = idx2crd(work_tile_info.L_idx, shape<4>(gB_nkl)); // Usually just returns work_tile_info.L_idx
+          did_batch_change = next_batch != curr_batch;
+          if (work_tile_info.is_valid() && did_batch_change) {
+            curr_batch = next_batch;
+            if constexpr (IsGroupedGemmKernel) {
+              problem_shape_MNKL = append<4>(params.problem_shape.get_problem_shape(curr_batch), 1);
+            }
+            // Purpose of this pipeline state is to make sure TMA loads have finished before doing descriptor updates
+            // Since this state is waiting for loads to finish, it must start in the inverted phase.
+            typename CollectiveMainloop::PipelineState mainloop_pipe_tma_consumer_state =
+              {mainloop_pipe_producer_state.index(), !mainloop_pipe_producer_state.phase(), mainloop_pipe_producer_state.count()};
+            mainloop_pipeline.consumer_wait(mainloop_pipe_tma_consumer_state);
+            collective_mainloop.tensormaps_perform_update(
+              shared_storage.tensormaps.mainloop,
+              params.mainloop,
+              input_tensormaps,
+              problem_shape_MNKL,
+              curr_batch
+            );
+            // Ensure warp is converged before issuing tensor replace
+            __syncwarp();
+            // Entire warp must do this (i.e. it's aligned)
+            collective_mainloop.tensormaps_cp_fence_release(shared_storage.tensormaps.mainloop, input_tensormaps);
+          }
+          // Advance the producer state for the last remaining stage that was being waited for above
+          mainloop_pipe_producer_state.advance(1);
+        } // Scheduler work fetch loop
+
+        // Make sure all Consumer Warp Groups have been waited upon
+        collective_mainloop.load_tail(mainloop_pipeline, mainloop_pipe_producer_state);
+      } // Mainloop Producer Warp End
+
+      // Epilogue Producer Warp
+      else if (producer_warp_role == ProducerWarpRole::Epilogue && collective_epilogue.is_producer_load_needed()) {
+        int32_t const sm_idx = blockIdx.x + (blockIdx.y * gridDim.x);
+        int32_t const sm_count = params.hw_info.sm_count;
+
+        auto epi_load_tensormap = get<0>(collective_epilogue.load_init(params.epilogue, shared_storage.tensormaps.epilogue, sm_count, sm_idx));
+
+        bool did_batch_change = true;
+        constexpr bool IsEpiLoad = true;
+
+        if (work_tile_info.is_valid()) {
+          collective_epilogue.tensormaps_perform_update<IsEpiLoad>(
+            shared_storage.tensormaps.epilogue,
+            params.epilogue,
+            epi_load_tensormap,
+            problem_shape_MNKL,
+            work_tile_info.L_idx,
+            0
+          );
+
+          // Converge before issuing tensormap fence release since fence is aligned
+          __syncwarp();
+          collective_epilogue.tensormaps_cp_fence_release<IsEpiLoad>(shared_storage.tensormaps.epilogue, epi_load_tensormap, 0);
+        }
+
+        load_order_barrier.wait();
+
+        while (work_tile_info.is_valid()) {
+          int32_t curr_batch = work_tile_info.L_idx;
+
+          // Get next work tile
+          auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info);
+
+          if (TileScheduler::compute_epilogue(work_tile_info, params.scheduler)) {
+            if constexpr (IsGroupedGemmKernel) {
+              problem_shape_MNKL = append<4>(params.problem_shape.get_problem_shape(work_tile_info.L_idx), 1);
+            }
+
+            // Compute m_coord, n_coord, l_coord with the post-tiled m-shape and n-shape
+            auto m_coord = idx2crd(work_tile_info.M_idx, shape<2>(gA_mkl));
+            auto n_coord = idx2crd(work_tile_info.N_idx, shape<2>(gB_nkl));
+            auto l_coord = idx2crd(work_tile_info.L_idx, shape<4>(gB_nkl));
+            auto blk_coord = make_coord(m_coord, n_coord, _, l_coord);
+
+            if (did_batch_change) {
+              collective_epilogue.tensormaps_fence_acquire<IsEpiLoad>(epi_load_tensormap);
+            }
+
+            bool wait = work_tile_info.is_valid() && curr_batch != next_work_tile_info.L_idx;
+
+            epi_load_pipe_producer_state = collective_epilogue.load(
+              epi_load_pipeline,
+              epi_load_pipe_producer_state,
+              problem_shape_MNKL,
+              blk_shape,
+              blk_coord,
+              tiled_mma,
+              lane_idx,
+              shared_storage.tensors.epilogue,
+              epi_load_tensormap,
+              work_tile_info.reduction_subtile_idx(),
+              wait
+            );
+          }
+
+          work_tile_info = next_work_tile_info;
+          did_batch_change = curr_batch != work_tile_info.L_idx;
+
+          if (work_tile_info.is_valid() && did_batch_change) {
+            if constexpr (IsGroupedGemmKernel) {
+              problem_shape_MNKL = append<4>(params.problem_shape.get_problem_shape(work_tile_info.L_idx), 1);
+            }
+
+            // tensormap update
+            {
+              collective_epilogue.tensormaps_perform_update<IsEpiLoad>(
+                shared_storage.tensormaps.epilogue,
+                params.epilogue,
+                epi_load_tensormap,
+                problem_shape_MNKL,
+                work_tile_info.L_idx,
+                0
+              );
+
+              // Converge before issuing tensormap fence release since fence is aligned
+              __syncwarp();
+              collective_epilogue.tensormaps_cp_fence_release<IsEpiLoad>(shared_storage.tensormaps.epilogue, epi_load_tensormap, 0);
+            }
+          }
+
+        } // Scheduler work fetch loop
+
+        // Make sure all Consumer Warp Groups have been waited upon
+        collective_epilogue.load_tail(epi_load_pipeline, epi_load_pipe_producer_state);
+      } // Epilogue Producer Warp End
+    } // Producer Warp Group End
+
+    else if (warp_group_role == WarpGroupRole::Consumer0 || warp_group_role == WarpGroupRole::Consumer1) {
+      cutlass::arch::warpgroup_reg_alloc<MmaRegisterRequirement>();
+
+      // Index of warp group within consumer warp groups
+      int consumer_warp_group_idx = warp_group_role == WarpGroupRole::Consumer0 ? 0 : 1;
+
+      int32_t const sm_idx = blockIdx.x + (blockIdx.y * gridDim.x);
+      int32_t const sm_count = params.hw_info.sm_count;
+      // Do we potentially issue tail arrives for TMA stores, if epilogue load is waiting for it
+      bool do_store_tail = false;
+      // Get a copy of tensormaps
+      auto epi_store_tensormap = get<0>(collective_epilogue.store_init(params.epilogue, shared_storage.tensormaps.epilogue, sm_count, sm_idx, consumer_warp_group_idx));
+
+      bool did_batch_change = true;
+      constexpr bool IsEpiLoad = false;
+
+      if (work_tile_info.is_valid()) {
+
+        if (warp_idx_in_warp_group == 0) {
+          collective_epilogue.tensormaps_perform_update<IsEpiLoad>(
+            shared_storage.tensormaps.epilogue,
+            params.epilogue,
+            epi_store_tensormap,
+            problem_shape_MNKL,
+            work_tile_info.L_idx,
+            consumer_warp_group_idx
+          );
+
+          // Converge before issuing tensormap fence release since fence is aligned
+          __syncwarp();
+          collective_epilogue.tensormaps_cp_fence_release<IsEpiLoad>(shared_storage.tensormaps.epilogue,
+                                                                     epi_store_tensormap,
+                                                                     consumer_warp_group_idx);
+        }
+      }
+
+      while (work_tile_info.is_valid()) {
+        if constexpr (IsGroupedGemmKernel) {
+          problem_shape_MNKL = append<4>(params.problem_shape.get_problem_shape(work_tile_info.L_idx), 1);
+        }
+
+        int32_t curr_batch = work_tile_info.L_idx;
+
+        // Compute m_coord, n_coord, l_coord with the post-tiled m-shape and n-shape
+        auto m_coord = idx2crd(work_tile_info.M_idx, shape<2>(gA_mkl));
+        auto n_coord = idx2crd(work_tile_info.N_idx, shape<2>(gB_nkl));
+        auto l_coord = idx2crd(work_tile_info.L_idx, shape<4>(gB_nkl));
+        auto blk_coord = make_coord(m_coord, n_coord, _, l_coord);
+        auto work_k_tile_count = TileScheduler::get_work_k_tile_count(work_tile_info, problem_shape_MNKL, blk_shape);
+
+        // Allocate the accumulators for the (M,N) blk_shape
+        //
+        // MSVC CTAD breaks if we say "Tensor" here, so we use "auto" instead.
+        auto accumulators = partition_fragment_C(tiled_mma, take<0,2>(blk_shape));               // (MMA,MMA_M,MMA_N)
+
+        static_assert(cute::is_any_of_v<TileScheduler,
+            detail::PersistentTileSchedulerSm90Group<ProblemShape>,
+            detail::PersistentTileSchedulerSm90>);
+        if (TileScheduler::valid_warpgroup_in_work_tile(work_tile_info)) {
+
+          math_wg_order_barrier.wait();
+
+          collective_mainloop.mma(
+            mainloop_pipeline,
+            mainloop_pipe_consumer_state,
+            accumulators,
+            work_k_tile_count,
+            mma_thread_idx,
+            shared_storage.tensors.mainloop,
+            params.mainloop
+          );
+
+          math_wg_order_barrier.arrive();
+
+          // Make sure the math instructions are done and free buffers before entering the epilogue
+          collective_mainloop.mma_tail(
+            mainloop_pipeline,
+            mainloop_pipe_consumer_state,
+            work_k_tile_count
+          );
+
+           math_wg_order_barrier.wait();
+
+          // Update starting mainloop pipeline state for the next tile
+          mainloop_pipe_consumer_state.advance(work_k_tile_count);
+        }
+
+        // Perform reduction across splits, if needed
+        TileScheduler::fixup(
+          params.scheduler, work_tile_info, accumulators, NumMmaWarpGroups, consumer_warp_group_idx);
+
+        if (did_batch_change) {
+          collective_epilogue.tensormaps_fence_acquire<IsEpiLoad>(epi_store_tensormap);
+        }
+
+        if (TileScheduler::compute_epilogue(work_tile_info, params.scheduler)) {
+
+          // Epilogue and write to gD
+          auto [epi_load_pipe_consumer_state_next, epi_store_pipe_producer_state_next] =
+          collective_epilogue.store(
+            epi_load_pipeline,
+            epi_load_pipe_consumer_state,
+            epi_store_pipeline,
+            epi_store_pipe_producer_state,
+            problem_shape_MNKL,
+            blk_shape,
+            blk_coord,
+            accumulators,
+            tiled_mma,
+            mma_thread_idx,
+            shared_storage.tensors.epilogue,
+            epi_store_tensormap,
+            work_tile_info.reduction_subtile_idx()
+          );
+
+          epi_load_pipe_consumer_state = epi_load_pipe_consumer_state_next;
+          epi_store_pipe_producer_state = epi_store_pipe_producer_state_next;
+          do_store_tail = true;
+        }
+
+        // Get next work tile
+        auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info);
+        work_tile_info = next_work_tile_info;
+
+        // Skip a tile for pingpong
+        if (work_tile_info.is_valid()) {
+          if constexpr (IsGroupedGemmKernel) {
+            problem_shape_MNKL = append<4>(params.problem_shape.get_problem_shape(work_tile_info.L_idx), 1);
+          }
+          work_k_tile_count = TileScheduler::get_work_k_tile_count(work_tile_info, problem_shape_MNKL, blk_shape);
+          mainloop_pipe_consumer_state.advance(work_k_tile_count);
+
+          // Go to next tile
+          auto [next_next_work_tile_info, next_increment_pipe] = scheduler.fetch_next_work(work_tile_info);
+
+          work_tile_info = next_next_work_tile_info;
+          increment_pipe = next_increment_pipe;
+        }
+
+        did_batch_change = curr_batch != work_tile_info.L_idx;
+        if (work_tile_info.is_valid() && did_batch_change) {
+          if constexpr (IsGroupedGemmKernel) {
+            problem_shape_MNKL = append<4>(params.problem_shape.get_problem_shape(work_tile_info.L_idx), 1);
+          }
+          if (warp_idx_in_warp_group == 0) {
+            collective_epilogue.tensormaps_perform_update<IsEpiLoad>(
+              shared_storage.tensormaps.epilogue,
+              params.epilogue,
+              epi_store_tensormap,
+              problem_shape_MNKL,
+              work_tile_info.L_idx,
+              consumer_warp_group_idx
+            );
+
+            // Converge before issuing tensormap fence release since fence is aligned
+            __syncwarp();
+            collective_epilogue.tensormaps_cp_fence_release<IsEpiLoad>(shared_storage.tensormaps.epilogue,
+                                                                       epi_store_tensormap,
+                                                                       consumer_warp_group_idx);
+          }
+        }
+
+        // TMA store pipeline wait is only visible to TMA-issuing warp, so for multiple-consumer kernels
+        // we need to wait for all TMA stores to complete before issuing consumer order barrier arrives
+        // to ensure next math consumer doesn't overwrite smem of in-flight TMA stores of current consumer.
+        auto [epi_load_pipe_consumer_state_next_, epi_store_pipe_producer_state_next_] =
+        collective_epilogue.store_tail(
+          epi_load_pipeline,
+          epi_load_pipe_consumer_state,
+          epi_store_pipeline,
+          epi_store_pipe_producer_state
+        );
+
+        // Update starting load/store pipeline states for the next tile
+        // state has already been incremented by 1 tile in collective calls, advance once again for ping pong
+        epi_load_pipe_consumer_state = epi_load_pipe_consumer_state_next_;
+        epi_store_pipe_producer_state = epi_store_pipe_producer_state_next_;
+        epi_load_pipe_consumer_state.advance(c_tile_count);
+        epi_store_pipe_producer_state.advance(d_tile_count);
+
+        // Cue for next Math WG's Epilogue to start
+        math_wg_order_barrier.arrive();
+
+      } // Scheduler work fetch loop
+    } // Consumer Warp Groups End
+#endif
+  }
+};
+
+///////////////////////////////////////////////////////////////////////////////
+
+} // namespace cutlass::gemm::kernel
diff --git a/include/cutlass/gemm/kernel/sm90_gemm_tma.hpp b/include/cutlass/gemm/kernel/sm90_gemm_tma.hpp
index 8727c766e0..482af93061 100644
--- a/include/cutlass/gemm/kernel/sm90_gemm_tma.hpp
+++ b/include/cutlass/gemm/kernel/sm90_gemm_tma.hpp
@@ -70,6 +70,8 @@ class GemmUniversal<
   using ProblemShape = ProblemShape_;
   static_assert(cute::rank(ProblemShape{}) == 3 or cute::rank(ProblemShape{}) == 4,
     "ProblemShape{} should be <M,N,K> or <M,N,K,L>");
+  static constexpr bool IsGdcEnabled = false;
+
   // Mainloop derived types
   using CollectiveMainloop = CollectiveMainloop_;
   using TileShape = typename CollectiveMainloop::TileShape;
diff --git a/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized.hpp b/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized.hpp
index 1ef05bac54..a8a5b11074 100644
--- a/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized.hpp
+++ b/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized.hpp
@@ -33,7 +33,6 @@
 #include "cutlass/cutlass.h"
 #include "cutlass/fast_math.h"
 #include "cutlass/kernel_hardware_info.hpp"
-#include "cute/arch/cluster_sm90.hpp"
 #include "cutlass/arch/reg_reconfig.h"
 #include "cutlass/arch/mma_sm90.h"
 #include "cutlass/epilogue/collective/detail.hpp"
@@ -43,7 +42,14 @@
 #include "cutlass/pipeline/pipeline.hpp"
 #include "cutlass/trace.h"
 
+#include "cutlass/conv/detail.hpp"
+
 #include "cute/tensor.hpp"
+#include "cute/arch/cluster_sm90.hpp"
+
+#include "cutlass/arch/grid_dependency_control.h"
+
+
 ///////////////////////////////////////////////////////////////////////////////
 
 namespace cutlass::gemm::kernel {
@@ -61,15 +67,22 @@ class GemmUniversal<
   CollectiveMainloop_,
   CollectiveEpilogue_,
   TileScheduler_,
-  cute::enable_if_t<cute::is_base_of_v<KernelTmaWarpSpecialized, typename CollectiveMainloop_::DispatchPolicy::Schedule>>>
+  cute::enable_if_t<cute::is_base_of_v<cutlass::gemm::KernelTmaWarpSpecialized, typename CollectiveMainloop_::DispatchPolicy::Schedule>>
+>
 {
 public:
   //
   // Type Aliases
   //
   using ProblemShape = ProblemShape_;
-  static_assert(cute::rank(ProblemShape{}) == 3 or cute::rank(ProblemShape{}) == 4,
-    "ProblemShape{} should be <M,N,K> or <M,N,K,L>");
+
+  // Handles the static_assert placed inside the operator()
+  // This is also used to decide whether the load_init inside collective mainloop returns rank 4 tensors or rank 5 tensors
+  static constexpr bool IsConvProblemShape = not (cute::is_tuple_v<ProblemShape>|| IsCutlass3ArrayKernel<ProblemShape>::value);
+  static_assert( IsConvProblemShape || (cute::rank(ProblemShape{}) == 3 || cute::rank(ProblemShape{}) == 4), "ProblemShape{} should be <M,N,K> or <M,N,K,L> for Gemm");
+
+  static constexpr bool IsGdcEnabled = cutlass::arch::IsGdcGloballyEnabled;
+
   // Mainloop derived types
   using CollectiveMainloop = CollectiveMainloop_;
   using TileShape = typename CollectiveMainloop::TileShape;
@@ -99,7 +112,8 @@ class GemmUniversal<
     "TMA warp-specialized kernel does not support specializing the tile scheduler.");
   using TileSchedulerTag = TileScheduler_;
   using TileScheduler = typename detail::TileSchedulerSelector<
-    TileScheduler_, ArchTag, TileShape, ClusterShape>::Scheduler;
+    TileSchedulerTag, ArchTag, TileShape, ClusterShape>::Scheduler;
+
   using TileSchedulerArguments = typename TileScheduler::Arguments;
 
   // Kernel level shared memory storage
@@ -113,7 +127,7 @@ class GemmUniversal<
       EpilogueTensorStorage epilogue;
     } tensors;
 
-    struct PipelineStorage : cute::aligned_struct<16> {
+    struct PipelineStorage : cute::aligned_struct<16, _1> {
       using MainloopPipelineStorage = typename CollectiveMainloop::PipelineStorage;
       using EpiLoadPipelineStorage = typename CollectiveEpilogue::PipelineStorage;
 
@@ -123,7 +137,6 @@ class GemmUniversal<
   };
 
   static constexpr int SharedStorageSize = sizeof(SharedStorage);
-
   static constexpr uint32_t NumLoadWarpGroups = 1;
   static constexpr uint32_t NumMmaWarpGroups = 1;
   static constexpr uint32_t MaxThreadsPerBlock = CUTE_STATIC_V(size(TiledMma{})) + (NumLoadWarpGroups * NumThreadsPerWarpGroup);
@@ -131,18 +144,54 @@ class GemmUniversal<
 
   // Device side arguments
   struct Arguments {
-    GemmUniversalMode mode{};
+    cutlass::gemm::GemmUniversalMode mode{}; //maintained here for backward compatibility
     ProblemShape problem_shape{};
     MainloopArguments mainloop{};
     EpilogueArguments epilogue{};
     KernelHardwareInfo hw_info{};
     TileSchedulerArguments scheduler{};
+
+    // Default constructor
+    Arguments() = default;
+
+    // Constructor with specified mode
+    // It is used for Gemm
+    Arguments(
+        cutlass::gemm::GemmUniversalMode mode_,
+        ProblemShape problem_shape_,
+        MainloopArguments mainloop_,
+        EpilogueArguments epilogue_,
+        KernelHardwareInfo hw_info_ = KernelHardwareInfo(),
+        TileSchedulerArguments scheduler_ = TileSchedulerArguments())
+    : mode(mode_)
+      , problem_shape(problem_shape_)
+      , mainloop(mainloop_)
+      , epilogue(epilogue_)
+      , hw_info(hw_info_)
+      , scheduler(scheduler_) {}
+
+    // Constructor with default value for 'mode'
+    // This allows us to set GemmUniversal mode as kGemm for Conv right away
+    // while keeping the testbeds unchanged
+    Arguments(
+        ProblemShape problem_shape_,
+        MainloopArguments mainloop_,
+        EpilogueArguments epilogue_,
+        KernelHardwareInfo hw_info_ = KernelHardwareInfo(),
+        TileSchedulerArguments scheduler_ = TileSchedulerArguments())
+    : mode(cutlass::gemm::GemmUniversalMode::kGemm) // Default mode
+      , problem_shape(problem_shape_)
+      , mainloop(mainloop_)
+      , epilogue(epilogue_)
+      , hw_info(hw_info_)
+      , scheduler(scheduler_) {}
+
   };
 
   // Kernel entry point API
   struct Params {
-    GemmUniversalMode mode{};
-    ProblemShape problem_shape{};
+    using ProblemShapeMNKL = decltype(cutlass::conv::detail::get_problem_shape_MNKL_helper<CollectiveMainloop>(ProblemShape{}, cute::conditional_t<IsConvProblemShape, cute::true_type, cute::false_type>{}));
+    ProblemShapeMNKL problem_shape{};
     MainloopParams mainloop{};
     EpilogueParams epilogue{};
   };
@@ -152,47 +201,49 @@ class GemmUniversal<
   //
 
   // Convert to underlying arguments. In this case, a simple copy for the aliased type.
-  static
-  Params
+  static Params
   to_underlying_arguments(Arguments const& args, void* workspace) {
+
     (void) workspace;
-    auto problem_shape = args.problem_shape;
+    auto problem_shape_mnkl = cutlass::conv::detail::get_problem_shape_MNKL_helper<CollectiveMainloop>(args.problem_shape, cute::conditional_t<IsConvProblemShape, cute::true_type, cute::false_type>{});
+    auto transformed_problem_shape = cutlass::conv::detail::get_transformed_problem_shape_MNKL(args.problem_shape);
+
+    auto swapped_problem_shape = problem_shape_mnkl;
     if constexpr (detail::Has_SwapAB_v<CollectiveMainloop>) {
       // swap M/N
-      get<0>(problem_shape) = get<1>(args.problem_shape);
-      get<1>(problem_shape) = get<0>(args.problem_shape);
+      get<0>(swapped_problem_shape) = get<1>(problem_shape_mnkl);
+      get<1>(swapped_problem_shape) = get<0>(problem_shape_mnkl);
     }
     return {
-      args.mode,
-      problem_shape,
+      swapped_problem_shape,
       CollectiveMainloop::to_underlying_arguments(args.problem_shape, args.mainloop, workspace),
-      CollectiveEpilogue::to_underlying_arguments(args.problem_shape, args.epilogue, workspace)
+      CollectiveEpilogue::to_underlying_arguments(transformed_problem_shape, args.epilogue, workspace)
     };
   }
 
   static bool
   can_implement(Arguments const& args) {
-    bool implementable = (args.mode == GemmUniversalMode::kGemm) or
-        (args.mode == GemmUniversalMode::kBatched && cute::rank(ProblemShape{}) == 4);
+    bool implementable = true;
+    auto transformed_problem_shape = cutlass::conv::detail::get_transformed_problem_shape_MNKL(args.problem_shape);
+
     if (!implementable) {
-      CUTLASS_TRACE_HOST("  CAN IMPLEMENT: Arguments or Problem Shape don't meet the requirements.\n");
-      return implementable;
+        CUTLASS_TRACE_HOST("  CAN IMPLEMENT: Arguments or Problem Shape don't meet the requirements.\n");
+        return implementable;
     }
+
     implementable &= CollectiveMainloop::can_implement(args.problem_shape, args.mainloop);
-    implementable &= CollectiveEpilogue::can_implement(args.problem_shape, args.epilogue);
+    implementable &= CollectiveEpilogue::can_implement(transformed_problem_shape, args.epilogue);
     implementable &= TileScheduler::can_implement(args.scheduler);
 
     return implementable;
   }
 
-  static
-  size_t
+  static size_t
   get_workspace_size(Arguments const& args) {
     return 0;
   }
 
-  static
-  cutlass::Status
+  static cutlass::Status
   initialize_workspace(Arguments const& args, void* workspace = nullptr, cudaStream_t stream = nullptr,
     CudaHostAdapter* cuda_adapter = nullptr) {
     return Status::kSuccess;
@@ -319,26 +370,28 @@ class GemmUniversal<
       }
     } ();
 
-    // Preconditions
-    static_assert(cute::rank(StrideA{}) == 3, "StrideA must be rank-3: [M, K, L]. If batch mode is not needed, set L stride to Int<0>.");
-    static_assert(cute::rank(StrideB{}) == 3, "StrideB must be rank-3: [N, K, L]. If batch mode is not needed, set L stride to Int<0>.");
-    static_assert(cute::rank(StrideC{}) == 3, "StrideC must be rank-3: [M, N, L]. If batch mode is not needed, set L stride to Int<0>.");
-    static_assert(cute::rank(StrideD{}) == 3, "StrideD must be rank-3: [M, N, L]. If batch mode is not needed, set L stride to Int<0>.");
-
-    // Optionally append 1s until problem shape is rank-4 in case it is only rank-3 (MNK)
-    auto problem_shape_MNKL = append<4>(params.problem_shape, Int<1>{});
+    // Preconditions only valid for Gemm
+    static_assert(IsConvProblemShape || cute::rank(StrideA{}) == 3, "StrideA must be rank-3: [M, K, L]. If batch mode is not needed, set L stride to Int<0>.");
+    static_assert(IsConvProblemShape || cute::rank(StrideB{}) == 3, "StrideB must be rank-3: [N, K, L]. If batch mode is not needed, set L stride to Int<0>.");
+    static_assert(IsConvProblemShape || cute::rank(StrideC{}) == 3, "StrideC must be rank-3: [M, N, L]. If batch mode is not needed, set L stride to Int<0>.");
+    static_assert(IsConvProblemShape || cute::rank(StrideD{}) == 3, "StrideD must be rank-3: [M, N, L]. If batch mode is not needed, set L stride to Int<0>.");
 
     // Get the appropriate blocks for this thread block -- potential for thread block locality
-    auto blk_shape = TileShape{};                                                                // (BLK_M,BLK_N,BLK_K)
+    auto blk_shape = TileShape{}; // (BLK_M,BLK_N,BLK_K)
     TiledMma tiled_mma;
 
+    // Optionally append 1s until problem shape is rank-4 in case it is only rank-3 (MNK)
+    // Using constexpr if (C++17 and later)
+    auto problem_shape_MNKL = append<4>(params.problem_shape, cute::Int<1>{});
+
     // In a warp specialized kernel, collectives expose data movement and compute operations separately
     CollectiveMainloop collective_mainloop;
     CollectiveEpilogue collective_epilogue(params.epilogue, shared_storage.tensors.epilogue);
 
-    // Prepare and partition the input tensors. Expects a tuple of tensors where:
-    // get<0>(load_inputs) is the tma tensor A after local tiling so that it has shape (BLK_M,BLK_K,m,k,l)
-    // get<1>(load_inputs) is the tma tensor B after local tiling so that it has shape (BLK_N,BLK_K,n,k,l)
+    // Prepare and partition the input tensors.
+    // Expects a tuple of tensors for conv where:
+    // get<0>(load_inputs) is the tma tensor A after local tiling so that it has shape (BLK_M,BLK_K,m,k)
+    // get<1>(load_inputs) is the tma tensor B after local tiling so that it has shape (BLK_N,BLK_K,n,k)
     auto load_inputs = collective_mainloop.load_init(problem_shape_MNKL, params.mainloop);
     static_assert(cute::tuple_size_v<decltype(load_inputs)> >= 2, "Output of load_init must have at least two elements (A, B)");
 
@@ -348,8 +401,23 @@ class GemmUniversal<
 
     // Compute m_coord, n_coord, and l_coord with their post-tiled shapes
     auto m_coord = idx2crd(int(BlockIdxX()), shape<2>(gA_mkl));
-    auto n_coord = idx2crd(int(BlockIdxY()), shape<2>(gB_nkl));
-    auto l_coord = idx2crd(int(BlockIdxZ()), shape<4>(gB_nkl));
+
+    auto n_coord = idx2crd(int(BlockIdxY()), shape<2>(gB_nkl), compact_col_major(shape<2>(gB_nkl)));
+
+    // handles the difference between the rank of Tensor returned by load_input in case they do not have a batch mode
+    auto l_coord = [&] (auto const& gB_nkl_) {
+      // gB_nkl needs to be passed into the lambda because C++17
+      // does not permit lambda capture of structured bindings.
+      if constexpr (not IsConvProblemShape) {
+        // This needs to be inside an `if constexpr`,
+        // because shape<4>(gB_nkl) is not well-formed otherwise.
+        return idx2crd(int(blockIdx.z), shape<4>(gB_nkl_));
+      }
+      else {
+        return Int<0>{};
+      }
+    } (gB_nkl);
+
     auto blk_coord = make_coord(m_coord, n_coord, _, l_coord);
 
     // Get pipeline iterators and increments from tensor shapes
@@ -361,6 +429,9 @@ class GemmUniversal<
 
     if (warp_group_role == WarpGroupRole::Producer) {
       if (producer_warp_role == ProducerWarpRole::MainloopEpilogue) {
+        // Ensure that the prefetched kernel does not touch
+        // unflushed global memory prior to this instruction
+        cutlass::arch::wait_on_dependent_grids();
         collective_mainloop.load(
           params.mainloop,
           mainloop_pipeline,
@@ -414,6 +485,11 @@ class GemmUniversal<
         k_tile_count
       );
 
+      // Hint on an early release of global memory resources.
+      // The timing of calling this function only influences performance,
+      // not functional correctness.
+      cutlass::arch::launch_dependent_grids();
+
       // Epilogue and write to gD
       auto [epi_load_pipe_consumer_state_next, epi_store_pipe_producer_state_next] =
       collective_epilogue.store(
diff --git a/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp b/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp
index 04d0aa31bd..f885c81037 100644
--- a/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp
+++ b/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp
@@ -45,6 +45,8 @@
 #include "cute/tensor.hpp"
 #include "cutlass/trace.h"
 #include "cutlass/gemm/kernel/gemm_universal_decl.h"
+#include "cutlass/arch/grid_dependency_control.h"
+
 ///////////////////////////////////////////////////////////////////////////////
 
 namespace cutlass::gemm::kernel {
@@ -55,13 +57,13 @@ template <
   class ProblemShape_,
   class CollectiveMainloop_,
   class CollectiveEpilogue_,
-  class TileScheduler_
+  class TileSchedulerTag_
 >
 class GemmUniversal<
   ProblemShape_,
   CollectiveMainloop_,
   CollectiveEpilogue_,
-  TileScheduler_,
+  TileSchedulerTag_,
   cute::enable_if_t<cute::is_base_of_v<KernelTmaWarpSpecializedCooperative, typename CollectiveMainloop_::DispatchPolicy::Schedule>>>
 {
 public:
@@ -71,6 +73,7 @@ class GemmUniversal<
   using ProblemShape = ProblemShape_;
   static_assert(cute::rank(ProblemShape{}) == 3 or cute::rank(ProblemShape{}) == 4,
     "ProblemShape{} should be <M,N,K> or <M,N,K,L>");
+
   // Mainloop derived types
   using CollectiveMainloop = CollectiveMainloop_;
   using TileShape = typename CollectiveMainloop::TileShape;
@@ -85,7 +88,6 @@ class GemmUniversal<
   using ClusterShape = typename DispatchPolicy::ClusterShape;
   using MainloopArguments = typename CollectiveMainloop::Arguments;
   using MainloopParams = typename CollectiveMainloop::Params;
-
   // Epilogue derived types
   using CollectiveEpilogue = CollectiveEpilogue_;
   using ElementC = typename CollectiveEpilogue::ElementC;
@@ -97,15 +99,28 @@ class GemmUniversal<
 
   static_assert(ArchTag::kMinComputeCapability >= 90);
 
-  using TileSchedulerTag = TileScheduler_;
+  using TileSchedulerTag = TileSchedulerTag_;
+
   using TileScheduler = typename detail::TileSchedulerSelector<
-    TileScheduler_, ArchTag, TileShape, ClusterShape>::Scheduler;
+                                          TileSchedulerTag, 
+                                          ArchTag, 
+                                          TileShape,
+                                          ClusterShape
+                                          >::Scheduler;
+
   using TileSchedulerArguments = typename TileScheduler::Arguments;
   using TileSchedulerParams = typename TileScheduler::Params;
+  
+  // Warp specialization thread count per threadblock
+  static constexpr uint32_t NumMMAThreads          = size(TiledMma{});       // 8 warps
+  static constexpr uint32_t NumMainloopLoadThreads = NumThreadsPerWarp;      // 1 warp
+  static constexpr uint32_t NumEpilogueLoadThreads = NumThreadsPerWarp;      // 1 warp for C
+
+  static constexpr bool IsGdcEnabled = cutlass::arch::IsGdcGloballyEnabled;
 
   static constexpr uint32_t NumLoadWarpGroups = 1;
-  static constexpr uint32_t NumMmaWarpGroups = CUTE_STATIC_V(size(TiledMma{})) / NumThreadsPerWarpGroup;
-  static constexpr uint32_t MaxThreadsPerBlock = CUTE_STATIC_V(size(TiledMma{})) + (NumLoadWarpGroups * NumThreadsPerWarpGroup);
+  static constexpr uint32_t NumMmaWarpGroups = NumMMAThreads / NumThreadsPerWarpGroup;
+  static constexpr uint32_t MaxThreadsPerBlock = NumMMAThreads + (NumLoadWarpGroups * NumThreadsPerWarpGroup);
   static constexpr uint32_t MinBlocksPerMultiprocessor = 1;
 
   /// Register requirement for Load and Math WGs
@@ -117,7 +132,7 @@ class GemmUniversal<
 
   // Kernel level shared memory storage
   struct SharedStorage {
-    struct PipelineStorage : cute::aligned_struct<16> {
+    struct PipelineStorage : cute::aligned_struct<16, _1> {
       using MainloopPipelineStorage = typename CollectiveMainloop::PipelineStorage;
       using EpiLoadPipelineStorage = typename CollectiveEpilogue::PipelineStorage;
 
@@ -126,7 +141,7 @@ class GemmUniversal<
       alignas(16) typename LoadWarpOrderBarrier::SharedStorage load_order;
     } pipelines;
 
-    struct TensorStorage : cute::aligned_struct<128> {
+    struct TensorStorage : cute::aligned_struct<128, _1> {
       using MainloopTensorStorage = typename CollectiveMainloop::TensorStorage;
       using EpilogueTensorStorage = typename CollectiveEpilogue::TensorStorage;
 
@@ -256,9 +271,10 @@ class GemmUniversal<
     uint8_t* workspace_ptr = reinterpret_cast<uint8_t*>(workspace);
     size_t workspace_offset = 0;
     constexpr uint32_t NumEpilogueSubTiles = CollectiveEpilogue::get_store_pipe_increment(TileShape{});
+    static constexpr uint32_t NumAccumulatorMtxs = 1;
 
     status = TileScheduler::template initialize_workspace<ProblemShape, ElementAccumulator>(
-      args.scheduler, workspace_ptr + workspace_offset, stream, args.problem_shape, args.hw_info, NumMmaWarpGroups, NumEpilogueSubTiles, cuda_adapter);
+      args.scheduler, workspace_ptr + workspace_offset, stream, args.problem_shape, args.hw_info, NumMmaWarpGroups, NumEpilogueSubTiles, NumAccumulatorMtxs, cuda_adapter);
     workspace_offset += TileScheduler::template get_workspace_size<ProblemShape, ElementAccumulator>(
       args.scheduler, args.problem_shape, args.hw_info, NumMmaWarpGroups, NumEpilogueSubTiles);
     workspace_offset = round_nearest(workspace_offset,  MinWorkspaceAlignment);
@@ -285,7 +301,7 @@ class GemmUniversal<
       args.max_swizzle_size = 1 << params.scheduler.log_swizzle_size_;
     }
     args.raster_order = params.scheduler.raster_order_ == TileScheduler::RasterOrder::AlongN ? TileScheduler::RasterOrderOptions::AlongN : TileScheduler::RasterOrderOptions::AlongM;
-    return TileScheduler::get_grid_shape(params.problem_shape, TileShape{}, ClusterShape{}, params.hw_info, args);
+    return TileScheduler::get_grid_shape(params.scheduler, params.problem_shape, TileShape{}, ClusterShape{}, params.hw_info, args);
   }
 
   static dim3
@@ -298,7 +314,6 @@ class GemmUniversal<
   operator()(Params const& params, char* smem_buf) {
     using namespace cute;
     using X = Underscore;
-
 #if defined(__CUDA_ARCH_FEAT_SM90_ALL)
 #  define ENABLE_SM90_KERNEL_LEVEL 1
 #endif
@@ -308,7 +323,7 @@ class GemmUniversal<
 #else
 
     // Preconditions
-    static_assert(size(TiledMma{}) == 256, "Cooperative kernel must have TiledMMA operating using 256 threads.");
+    static_assert(NumMMAThreads == 256, "Cooperative kernel must have TiledMMA operating using 256 threads.");
     static_assert(size<0>(TileShape{}) >= 128,
         "Cooperative kernel requires Tile Size to be greater than or equal to 128 along the M-dimension.");
 
@@ -330,6 +345,8 @@ class GemmUniversal<
       Warp3 = 3
     };
 
+
+
     // Kernel level shared memory storage
     SharedStorage& shared_storage = *reinterpret_cast<SharedStorage*>(smem_buf);
 
@@ -338,7 +355,7 @@ class GemmUniversal<
     int warp_idx = canonical_warp_idx_sync();
     int warp_idx_in_warp_group = warp_idx % NumWarpsPerWarpGroup;
     int warp_group_thread_idx = thread_idx % NumThreadsPerWarpGroup;
-    int mma_thread_idx = thread_idx % size(TiledMma{});
+    int mma_thread_idx = thread_idx % NumMMAThreads;
     auto warp_group_role = WarpGroupRole(canonical_warp_group_idx());
     auto producer_warp_role = ProducerWarpRole(warp_idx_in_warp_group);
     int lane_predicate = cute::elect_one_sync();
@@ -350,6 +367,8 @@ class GemmUniversal<
       CollectiveEpilogue::prefetch_tma_descriptors(params.epilogue);
     }
 
+    CollectiveEpilogue collective_epilogue(params.epilogue, shared_storage.tensors.epilogue);
+    bool is_epi_load_needed = collective_epilogue.is_producer_load_needed();
     // Mainloop Load pipeline
     using MainloopPipeline = typename CollectiveMainloop::MainloopPipeline;
     typename MainloopPipeline::Params mainloop_pipeline_params;
@@ -360,7 +379,7 @@ class GemmUniversal<
       mainloop_pipeline_params.role = MainloopPipeline::ThreadCategory::Consumer;
     }
     mainloop_pipeline_params.is_leader = warp_group_thread_idx == 0;
-    mainloop_pipeline_params.num_consumers = size(TiledMma{});
+    mainloop_pipeline_params.num_consumers = NumMMAThreads;
     mainloop_pipeline_params.transaction_bytes = params.mainloop.tma_transaction_bytes;
     MainloopPipeline mainloop_pipeline(shared_storage.pipelines.mainloop, mainloop_pipeline_params, ClusterShape{});
 
@@ -369,13 +388,13 @@ class GemmUniversal<
     typename EpiLoadPipeline::Params epi_load_pipeline_params;
     if (warp_group_role == WarpGroupRole::Producer && producer_warp_role == ProducerWarpRole::Epilogue) {
       epi_load_pipeline_params.role = EpiLoadPipeline::ThreadCategory::Producer;
-    }
+    } 
     if (warp_group_role == WarpGroupRole::Consumer0 || warp_group_role == WarpGroupRole::Consumer1) {
       epi_load_pipeline_params.role = EpiLoadPipeline::ThreadCategory::Consumer;
     }
     epi_load_pipeline_params.dst_blockid = cute::block_rank_in_cluster();
-    epi_load_pipeline_params.producer_arv_count = NumThreadsPerWarp;
-    epi_load_pipeline_params.consumer_arv_count = size(TiledMma{});
+    epi_load_pipeline_params.producer_arv_count = NumEpilogueLoadThreads;
+    epi_load_pipeline_params.consumer_arv_count = NumMMAThreads;
     if constexpr (CollectiveEpilogue::RequiresTransactionBytes) {
       epi_load_pipeline_params.transaction_bytes = params.epilogue.tma_transaction_bytes;
     }
@@ -403,6 +422,7 @@ class GemmUniversal<
     PipelineState epi_load_pipe_producer_state = cutlass::make_producer_start_state<EpiLoadPipeline>();
     PipelineState epi_store_pipe_producer_state = cutlass::make_producer_start_state<EpiStorePipeline>();
 
+
     auto cluster_wait_fn = [] () {
       // We need this to guarantee that the Pipeline init is visible
       // To all producers and consumer thread blocks in the Cluster
@@ -425,7 +445,7 @@ class GemmUniversal<
 
     TileScheduler scheduler{params.scheduler};
     auto work_tile_info = scheduler.initial_work_tile_info(ClusterShape{});
-
+    
     // In a warp specialized kernel, collectives expose data movement and compute operations separately
     CollectiveMainloop collective_mainloop;
 
@@ -445,14 +465,16 @@ class GemmUniversal<
     if (warp_group_role == WarpGroupRole::Producer) {
       cutlass::arch::warpgroup_reg_dealloc<LoadRegisterRequirement>();
 
-      CollectiveEpilogue collective_epilogue(params.epilogue, shared_storage.tensors.epilogue);
-
       // Mainloop Producer Warp
       if (producer_warp_role == ProducerWarpRole::Mainloop) {
+        // Ensure that the prefetched kernel does not touch
+        // unflushed global memory prior to this instruction
+        cutlass::arch::wait_on_dependent_grids();
         bool do_load_order_arrive = true;
         while (work_tile_info.is_valid()) {
           if (!TileScheduler::valid_warpgroup_in_work_tile(work_tile_info)) {
-            work_tile_info = scheduler.fetch_next_work(work_tile_info);
+            auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info);
+            work_tile_info = next_work_tile_info;   
             continue;
           }
 
@@ -465,6 +487,7 @@ class GemmUniversal<
           // Get the number of K tiles to compute for this work as well as the starting K tile offset of the work.
           auto work_k_tile_count = TileScheduler::get_work_k_tile_count(work_tile_info, problem_shape_MNKL, blk_shape);
           auto work_k_tile_start = TileScheduler::get_work_k_tile_start(work_tile_info);
+
           auto k_tile_iter = cute::make_coord_iterator(idx2crd(work_k_tile_start, shape<3>(gA_mkl)), shape<3>(gA_mkl));
 
           collective_mainloop.load(
@@ -486,9 +509,11 @@ class GemmUniversal<
             load_order_barrier.arrive();
             do_load_order_arrive = false;
           }
-
           // Get next work tile
-          work_tile_info = scheduler.fetch_next_work(work_tile_info);
+          auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info
+                                                                           );
+
+          work_tile_info = next_work_tile_info;
         } // Scheduler work fetch loop
 
         // Make sure all Consumer Warp Groups have been waited upon
@@ -497,11 +522,18 @@ class GemmUniversal<
       } // Mainloop Producer Warp End
 
       // Epilogue Producer Warp
-      else if (producer_warp_role == ProducerWarpRole::Epilogue && collective_epilogue.is_producer_load_needed()) {
+      else if (producer_warp_role == ProducerWarpRole::Epilogue && is_epi_load_needed) {
+
+        // Ensure that the prefetched kernel does not touch
+        // unflushed global memory prior to this instruction
+        cutlass::arch::wait_on_dependent_grids();
 
         if (!TileScheduler::requires_separate_reduction(params.scheduler) && work_tile_info.is_valid()) {
           load_order_barrier.wait();
         }
+
+        CollectiveEpilogue collective_epilogue(params.epilogue, shared_storage.tensors.epilogue);
+
         while (work_tile_info.is_valid()) {
           if (TileScheduler::compute_epilogue(work_tile_info, params.scheduler)) {
             // Compute m_coord, n_coord, l_coord with the post-tiled m-shape and n-shape
@@ -509,7 +541,7 @@ class GemmUniversal<
             auto n_coord = idx2crd(work_tile_info.N_idx, shape<2>(gB_nkl));
             auto l_coord = idx2crd(work_tile_info.L_idx, shape<4>(gB_nkl));
             auto blk_coord = make_coord(m_coord, n_coord, _, l_coord);
-
+            
             epi_load_pipe_producer_state =
             collective_epilogue.load(
               epi_load_pipeline,
@@ -525,7 +557,9 @@ class GemmUniversal<
           }
 
           // Get next work tile
-          work_tile_info = scheduler.fetch_next_work(work_tile_info);
+          auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info
+                                                                           );
+          work_tile_info = next_work_tile_info;
         } // Scheduler work fetch loop
 
         // Make sure all Consumer Warp Groups have been waited upon
@@ -552,7 +586,7 @@ class GemmUniversal<
         //
         // MSVC CTAD breaks if we say "Tensor" here, so we use "auto" instead.
         auto accumulators = partition_fragment_C(tiled_mma, take<0,2>(blk_shape));               // (MMA,MMA_M,MMA_N)
-        if(TileScheduler::valid_warpgroup_in_work_tile(work_tile_info)) {
+        if (TileScheduler::valid_warpgroup_in_work_tile(work_tile_info)) {
           collective_mainloop.mma(
             mainloop_pipeline,
             mainloop_pipe_consumer_state,
@@ -573,6 +607,16 @@ class GemmUniversal<
           // Update starting mainloop pipeline state for the next tile
           mainloop_pipe_consumer_state.advance(work_k_tile_count);
         }
+        #ifdef CUTLASS_ENABLE_GDC_FOR_SM90
+        if (scheduler.is_last_tile(work_tile_info)) {
+          // Hint on an early release of global memory resources.
+          // The timing of calling this function only influences performance,
+          // not functional correctness.
+          cutlass::arch::launch_dependent_grids();
+
+        }
+        #endif
+
         // Index of warp group within consumer warp groups
         int consumer_warp_group_idx = canonical_warp_group_idx() - NumLoadWarpGroups;
 
@@ -603,7 +647,9 @@ class GemmUniversal<
         }
 
         // Get next work tile
-        work_tile_info = scheduler.fetch_next_work(work_tile_info);
+        auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info
+                                                                          );
+        work_tile_info = next_work_tile_info;
       } // Scheduler work fetch loop
 
       if (do_store_tail) {
diff --git a/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp b/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp
index de983d6f01..c1ad6182b9 100644
--- a/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp
+++ b/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_pingpong.hpp
@@ -47,6 +47,8 @@
 #include "cutlass/trace.h"
 
 #include "cute/tensor.hpp"
+#include "cutlass/arch/grid_dependency_control.h"
+
 ///////////////////////////////////////////////////////////////////////////////
 
 namespace cutlass::gemm::kernel {
@@ -73,6 +75,8 @@ class GemmUniversal<
   using ProblemShape = ProblemShape_;
   static_assert(cute::rank(ProblemShape{}) == 3 or cute::rank(ProblemShape{}) == 4,
     "ProblemShape{} should be <M,N,K> or <M,N,K,L>");
+  static constexpr bool IsGdcEnabled = cutlass::arch::IsGdcGloballyEnabled;
+
   // Mainloop derived types
   using CollectiveMainloop = CollectiveMainloop_;
   using TileShape = typename CollectiveMainloop::TileShape;
@@ -128,7 +132,7 @@ class GemmUniversal<
 
   // Kernel level shared memory storage
   struct SharedStorage {
-    struct PipelineStorage : cute::aligned_struct<16> {
+    struct PipelineStorage : cute::aligned_struct<16, _1> {
       using MainloopPipelineStorage = typename CollectiveMainloop::PipelineStorage;
       using EpiLoadPipelineStorage = typename CollectiveEpilogue::PipelineStorage;
       using MathWarpGroupOrderBarrierStorage = MathWarpGroupOrderBarrierSharedStorage;
@@ -139,7 +143,7 @@ class GemmUniversal<
       alignas(16) typename LoadWarpOrderBarrier::SharedStorage load_order;
     } pipelines;
 
-    struct TensorStorage : cute::aligned_struct<128> {
+    struct TensorStorage : cute::aligned_struct<128, _1> {
       using MainloopTensorStorage = typename CollectiveMainloop::TensorStorage;
       using EpilogueTensorStorage = typename CollectiveEpilogue::TensorStorage;
 
@@ -259,9 +263,11 @@ class GemmUniversal<
     Status status = Status::kSuccess;
     uint8_t* workspace_ptr = reinterpret_cast<uint8_t*>(workspace);
     size_t workspace_offset = 0;
+    static constexpr uint32_t NumEpilogueSubTiles = 1;
+    static constexpr uint32_t NumAccumulatorMtxs = 1;
 
     status = TileScheduler::template initialize_workspace<ProblemShape, ElementAccumulator>(
-      args.scheduler, workspace_ptr + workspace_offset, stream, args.problem_shape, args.hw_info, NumMmaWarpGroups, 1, cuda_adapter);
+      args.scheduler, workspace_ptr + workspace_offset, stream, args.problem_shape, args.hw_info, NumMmaWarpGroups, NumEpilogueSubTiles, NumAccumulatorMtxs, cuda_adapter);
     workspace_offset += TileScheduler::template get_workspace_size<ProblemShape, ElementAccumulator>(
       args.scheduler, args.problem_shape, args.hw_info, NumMmaWarpGroups);
     workspace_offset = round_nearest(workspace_offset,  MinWorkspaceAlignment);
@@ -288,7 +294,7 @@ class GemmUniversal<
       args.max_swizzle_size = 1 << params.scheduler.log_swizzle_size_;
     }
     args.raster_order = params.scheduler.raster_order_ == TileScheduler::RasterOrder::AlongN ? TileScheduler::RasterOrderOptions::AlongN : TileScheduler::RasterOrderOptions::AlongM;
-    return TileScheduler::get_grid_shape(params.problem_shape, TileShape{}, ClusterShape{}, params.hw_info, args);
+    return TileScheduler::get_grid_shape(params.scheduler, params.problem_shape, TileShape{}, ClusterShape{}, params.hw_info, args);
   }
 
   static dim3
@@ -463,6 +469,9 @@ class GemmUniversal<
 
       // Mainloop Producer Warp
       if (producer_warp_role == ProducerWarpRole::Mainloop) {
+        // Ensure that the prefetched kernel does not touch
+        // unflushed global memory prior to this instruction
+        cutlass::arch::wait_on_dependent_grids();
         bool do_load_order_arrive = true;
         while (work_tile_info.is_valid()) {
           // Compute m_coord, n_coord, l_coord with the post-tiled m-shape and n-shape
@@ -506,6 +515,10 @@ class GemmUniversal<
       // Epilogue Producer Warp
       else if (producer_warp_role == ProducerWarpRole::Epilogue && collective_epilogue.is_producer_load_needed()) {
 
+        // Ensure that the prefetched kernel does not touch
+        // unflushed global memory prior to this instruction
+        cutlass::arch::wait_on_dependent_grids();
+
         load_order_barrier.wait();
         while (work_tile_info.is_valid()) {
           // Compute m_coord, n_coord, l_coord with the post-tiled m-shape and n-shape
@@ -539,6 +552,19 @@ class GemmUniversal<
     else if (warp_group_role == WarpGroupRole::Consumer0 || warp_group_role == WarpGroupRole::Consumer1) {
       cutlass::arch::warpgroup_reg_alloc<MmaRegisterRequirement>();
 
+      #ifdef CUTLASS_ENABLE_GDC_FOR_SM90
+      // It is possible to have work tiles start off invalid,
+      // so we have to check that first.
+      if (not work_tile_info.is_valid()) {
+        // Hint on an early release of global memory resources.
+        // The timing of calling this function only influences performance,
+        // not functional correctness.
+        cutlass::arch::launch_dependent_grids();
+
+        return;
+      }
+      #endif
+
       while (work_tile_info.is_valid()) {
         // Compute m_coord, n_coord, l_coord with the post-tiled m-shape and n-shape
         auto m_coord = idx2crd(work_tile_info.M_idx, shape<2>(gA_mkl));
@@ -574,6 +600,16 @@ class GemmUniversal<
         // Update starting mainloop pipeline state for the next tile
         mainloop_pipe_consumer_state.advance(k_tile_count * NumMmaWarpGroups);
 
+        #ifdef CUTLASS_ENABLE_GDC_FOR_SM90
+        if (scheduler.is_last_tile(work_tile_info, NumMmaWarpGroups)) {
+          // Hint on an early release of global memory resources.
+          // The timing of calling this function only influences performance,
+          // not functional correctness.
+          cutlass::arch::launch_dependent_grids();
+
+        }
+        #endif
+
         // Order two Math WG's Epilogue one after the other
         math_wg_order_barrier.wait();
 
diff --git a/include/cutlass/gemm/kernel/sm90_gemm_warpspecialized.hpp b/include/cutlass/gemm/kernel/sm90_gemm_warpspecialized.hpp
index fe871287b1..9febebecc7 100644
--- a/include/cutlass/gemm/kernel/sm90_gemm_warpspecialized.hpp
+++ b/include/cutlass/gemm/kernel/sm90_gemm_warpspecialized.hpp
@@ -68,6 +68,8 @@ class GemmUniversal<
   using ProblemShape = ProblemShape_;
   static_assert(cute::rank(ProblemShape{}) == 3 or cute::rank(ProblemShape{}) == 4,
     "ProblemShape{} should be <M,N,K> or <M,N,K,L>");
+  static constexpr bool IsGdcEnabled = false;
+
   // Mainloop derived types
   using CollectiveMainloop = CollectiveMainloop_;
   using TileShape = typename CollectiveMainloop::TileShape;
@@ -110,7 +112,7 @@ class GemmUniversal<
       EpilogueTensorStorage epilogue;
     } tensors;
 
-    struct PipelineStorage : cute::aligned_struct<16> {
+    struct PipelineStorage : cute::aligned_struct<16, _1> {
       using MainloopPipelineStorage = typename CollectiveMainloop::PipelineStorage;
       using EpiLoadPipelineStorage = typename CollectiveEpilogue::PipelineStorage;
 
diff --git a/include/cutlass/gemm/kernel/sm90_gemm_warpspecialized_cooperative.hpp b/include/cutlass/gemm/kernel/sm90_gemm_warpspecialized_cooperative.hpp
index 623443a640..62ef87da8e 100644
--- a/include/cutlass/gemm/kernel/sm90_gemm_warpspecialized_cooperative.hpp
+++ b/include/cutlass/gemm/kernel/sm90_gemm_warpspecialized_cooperative.hpp
@@ -69,6 +69,7 @@ class GemmUniversal<
   using ProblemShape = ProblemShape_;
   static_assert(cute::rank(ProblemShape{}) == 3 or cute::rank(ProblemShape{}) == 4,
     "ProblemShape{} should be <M,N,K> or <M,N,K,L>");
+  static constexpr bool IsGdcEnabled = false;
   // Mainloop derived types
   using CollectiveMainloop = CollectiveMainloop_;
   using TileShape = typename CollectiveMainloop::TileShape;
@@ -114,7 +115,7 @@ class GemmUniversal<
 
   // Kernel level shared memory storage
   struct SharedStorage {
-    struct TensorStorage : cute::aligned_struct<128> {
+    struct TensorStorage : cute::aligned_struct<128, _1> {
       using MainloopTensorStorage = typename CollectiveMainloop::TensorStorage;
       using EpilogueTensorStorage = typename CollectiveEpilogue::TensorStorage;
 
@@ -122,7 +123,7 @@ class GemmUniversal<
       EpilogueTensorStorage epilogue;
     } tensors;
 
-    struct PipelineStorage : cute::aligned_struct<16> {
+    struct PipelineStorage : cute::aligned_struct<16, _1> {
       using MainloopPipelineStorage = typename CollectiveMainloop::PipelineStorage;
       using EpiLoadPipelineStorage = typename CollectiveEpilogue::PipelineStorage;
 
@@ -223,8 +224,10 @@ class GemmUniversal<
   initialize_workspace(Arguments const& args, void* workspace = nullptr, cudaStream_t stream = nullptr,
     CudaHostAdapter* cuda_adapter = nullptr) {
     TileScheduler t;
+    static constexpr uint32_t NumEpilogueSubTiles = 1;
+    static constexpr uint32_t NumAccumulatorMtxs = 1;
     return t.template initialize_workspace<ProblemShape, ElementAccumulator>(
-      args.scheduler, workspace, stream, args.problem_shape, args.hw_info, NumMmaWarpGroups, 1, cuda_adapter);
+      args.scheduler, workspace, stream, args.problem_shape, args.hw_info, NumMmaWarpGroups, NumEpilogueSubTiles, NumAccumulatorMtxs, cuda_adapter);
   }
 
   // Computes the kernel launch grid shape based on runtime parameters
@@ -235,7 +238,7 @@ class GemmUniversal<
     if constexpr (!std::is_const_v<decltype(args.max_swizzle_size)>) {
       args.max_swizzle_size = 1 << params.scheduler.log_swizzle_size_;
     }
-    return TileScheduler::get_grid_shape(params.problem_shape, TileShape{}, ClusterShape{}, params.hw_info, args);
+    return TileScheduler::get_grid_shape(params.scheduler, params.problem_shape, TileShape{}, ClusterShape{}, params.hw_info, args);
   }
 
   static dim3
@@ -401,7 +404,8 @@ class GemmUniversal<
       }
 
         // Get next work tile
-        work_tile_info = scheduler.fetch_next_work(work_tile_info);
+        auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info);
+        work_tile_info = next_work_tile_info;
       } // Scheduler work fetch loop
 
       // Make sure all Consumer Warp Groups have been waited upon
@@ -477,7 +481,8 @@ class GemmUniversal<
         }
 
         // Get next work tile
-        work_tile_info = scheduler.fetch_next_work(work_tile_info);
+        auto [next_work_tile_info, increment_pipe] = scheduler.fetch_next_work(work_tile_info);
+        work_tile_info = next_work_tile_info;
       } // Scheduler work fetch loop
 
       if (do_store_tail) {
diff --git a/include/cutlass/gemm/kernel/sm90_gemm_warpspecialized_pingpong.hpp b/include/cutlass/gemm/kernel/sm90_gemm_warpspecialized_pingpong.hpp
index 5a77d9583d..ccde6397b1 100644
--- a/include/cutlass/gemm/kernel/sm90_gemm_warpspecialized_pingpong.hpp
+++ b/include/cutlass/gemm/kernel/sm90_gemm_warpspecialized_pingpong.hpp
@@ -71,6 +71,7 @@ class GemmUniversal<
   using ProblemShape = ProblemShape_;
   static_assert(cute::rank(ProblemShape{}) == 3 or cute::rank(ProblemShape{}) == 4,
     "ProblemShape{} should be <M,N,K> or <M,N,K,L>");
+  static constexpr bool IsGdcEnabled = false;
   // Mainloop derived types
   using CollectiveMainloop = CollectiveMainloop_;
   using TileShape = typename CollectiveMainloop::TileShape;
@@ -123,7 +124,7 @@ class GemmUniversal<
 
   // Kernel level shared memory storage
   struct SharedStorage {
-    struct TensorStorage : cute::aligned_struct<128> {
+    struct TensorStorage : cute::aligned_struct<128, _1> {
       using MainloopTensorStorage = typename CollectiveMainloop::TensorStorage;
       using EpilogueTensorStorage = typename CollectiveEpilogue::TensorStorage;
 
@@ -131,7 +132,7 @@ class GemmUniversal<
       EpilogueTensorStorage epilogue;
     } tensors;
 
-    struct PipelineStorage : cute::aligned_struct<16> {
+    struct PipelineStorage : cute::aligned_struct<16, _1> {
       using MainloopPipelineStorage = typename CollectiveMainloop::PipelineStorage;
       using EpiLoadPipelineStorage = typename CollectiveEpilogue::PipelineStorage;
       using MathWarpGroupOrderBarrierStorage = typename MathWarpGroupOrderBarrier::SharedStorage;
@@ -243,7 +244,7 @@ class GemmUniversal<
     if constexpr (!std::is_const_v<decltype(args.max_swizzle_size)>) {
       args.max_swizzle_size = 1 << params.scheduler.log_swizzle_size_;
     }
-    return TileScheduler::get_grid_shape(params.problem_shape, TileShape{}, ClusterShape{}, params.hw_info, args);
+    return TileScheduler::get_grid_shape(params.scheduler, params.problem_shape, TileShape{}, ClusterShape{}, params.hw_info, args);
   }
 
   static dim3
diff --git a/include/cutlass/gemm/kernel/sm90_tile_scheduler.hpp b/include/cutlass/gemm/kernel/sm90_tile_scheduler.hpp
index 68ea45c0b6..5e61e7c99d 100644
--- a/include/cutlass/gemm/kernel/sm90_tile_scheduler.hpp
+++ b/include/cutlass/gemm/kernel/sm90_tile_scheduler.hpp
@@ -48,6 +48,8 @@ public StaticPersistentTileScheduler<PersistentTileSchedulerSm90> {
   using RasterOrderOptions = typename Params::RasterOrderOptions;
   using Arguments = BaseScheduler::Arguments;
 
+  static constexpr bool IsDynamicPersistent = false;
+
   // get work_idx_m, work_idx_n from blk_per_grid_dim while applying swizzle
   static CUTLASS_DEVICE
   cute::tuple<int32_t, int32_t>
@@ -121,29 +123,17 @@ public StaticPersistentTileScheduler<PersistentTileSchedulerSm90> {
   // The basic tile scheduler does not require any additional workspace
   template <class ProblemShape, class ElementAccumulator>
   static size_t
-  get_workspace_size(Arguments const&, ProblemShape, KernelHardwareInfo const&, uint32_t, const uint32_t = 1) {
+  get_workspace_size(Arguments const&, ProblemShape, KernelHardwareInfo const&, uint32_t, const uint32_t = 1, uint32_t = 1) {
     return 0;
   }
 
   template <class ProblemShape, class ElementAccumulator>
   static cutlass::Status
   initialize_workspace(Arguments const&, void*, cudaStream_t, ProblemShape, KernelHardwareInfo const&,
-    uint32_t, const uint32_t = 1, CudaHostAdapter* cuda_adapter = nullptr) {
+    uint32_t, const uint32_t = 1, uint32_t = 1, CudaHostAdapter* cuda_adapter = nullptr) {
     return Status::kSuccess;
   }
 
-  // Kernel helper function to get next work tile
-  CUTLASS_DEVICE
-  auto
-  fetch_next_work(WorkTileInfo work_tile_info) {
-    if (continue_current_work(work_tile_info)) {
-      return work_tile_info;
-    }
-
-    advance_to_next_work();
-    return get_current_work();
-  }
-
 };
 
 }
diff --git a/include/cutlass/gemm/kernel/sm90_tile_scheduler_group.hpp b/include/cutlass/gemm/kernel/sm90_tile_scheduler_group.hpp
index ebeb0434b5..888be276d5 100644
--- a/include/cutlass/gemm/kernel/sm90_tile_scheduler_group.hpp
+++ b/include/cutlass/gemm/kernel/sm90_tile_scheduler_group.hpp
@@ -96,6 +96,8 @@ class PersistentTileSchedulerSm90Group {
   using Params = PersistentTileSchedulerSm90GroupParams<ProblemShape>;
   using RasterOrder = typename Params::RasterOrder;
   using RasterOrderOptions = typename Params::RasterOrderOptions;
+  static constexpr bool IsDynamicPersistent = false;
+
   struct Arguments {
     int max_swizzle_size = 1;
     // Not applying Heuristics for Grouped problems, since largest dimension can change per group
@@ -118,7 +120,9 @@ class PersistentTileSchedulerSm90Group {
     KernelHardwareInfo const& hw_info,
     Arguments const& arguments,
     [[maybe_unused]] void* workspace=nullptr,
-    [[maybe_unused]] const uint32_t epilogue_subtile = 1) {
+    [[maybe_unused]] const uint32_t epilogue_subtile = 1,
+    [[maybe_unused]] uint32_t ktile_start_alignment_count = 1u
+    ) {
 
     // We only need the tile and cluster shape during scheduler setup, so let FTAD do the magic
     static_assert(cute::is_static<TileShape>::value);
@@ -151,6 +155,7 @@ class PersistentTileSchedulerSm90Group {
   CUTLASS_HOST_DEVICE static
   dim3
   get_grid_shape(
+    [[maybe_unused]] Params const& params,
     GroupProblemShape problem_shapes,
     TileShape tile_shape,
     ClusterShape cluster_shape,
@@ -400,14 +405,14 @@ class PersistentTileSchedulerSm90Group {
   // The basic tile scheduler does not require any additional workspace
   template <class ProblemShape, class ElementAccumulator>
   static size_t
-  get_workspace_size(Arguments const&, ProblemShape, KernelHardwareInfo const&, uint32_t, const uint32_t = 1) {
+  get_workspace_size(Arguments const&, ProblemShape, KernelHardwareInfo const&, uint32_t, const uint32_t = 1, uint32_t = 1) {
     return 0;
   }
 
   template <class ProblemShape, class ElementAccumulator>
   static cutlass::Status
   initialize_workspace(Arguments const&, void*, cudaStream_t, ProblemShape, KernelHardwareInfo const&,
-    uint32_t, const uint32_t = 1, CudaHostAdapter* cuda_adapter = nullptr) {
+    uint32_t, const uint32_t = 1, uint32_t = 1, CudaHostAdapter* cuda_adapter = nullptr) {
     return Status::kSuccess;
   }
 
@@ -485,13 +490,13 @@ class PersistentTileSchedulerSm90Group {
   auto
   fetch_next_work(WorkTileInfo work_tile_info) {
     if (continue_current_work(work_tile_info)) {
-      return work_tile_info;
+      return cute::make_tuple(work_tile_info, true);
     }
 
     advance_to_next_work();
-    return get_current_work();
+    return cute::make_tuple(get_current_work(), true);
   }
-
+  
   // Returns the initial work tile info that will be computed over
   template <class ClusterShape>
   CUTLASS_DEVICE
diff --git a/include/cutlass/gemm/kernel/sm90_tile_scheduler_stream_k.hpp b/include/cutlass/gemm/kernel/sm90_tile_scheduler_stream_k.hpp
index 3e8db7a6e8..6bae2a1805 100644
--- a/include/cutlass/gemm/kernel/sm90_tile_scheduler_stream_k.hpp
+++ b/include/cutlass/gemm/kernel/sm90_tile_scheduler_stream_k.hpp
@@ -64,6 +64,8 @@ class PersistentTileSchedulerSm90StreamK {
 
   using RasterOrder = UnderlyingScheduler::RasterOrder;
   using RasterOrderOptions = UnderlyingScheduler::RasterOrderOptions;
+  static constexpr bool IsDynamicPersistent = false;
+
   // Use a dummy barrier manager to simply get the type used to store the barrier
   using BarrierType = typename NamedBarrierManager<1>::T;
 
@@ -194,13 +196,14 @@ class PersistentTileSchedulerSm90StreamK {
   template <class ProblemShape>
   static Params
   to_underlying_arguments(
-    ProblemShape problem_shape,
-    TileShape tile_shape,
-    ClusterShape cluster_shape,
-    KernelHardwareInfo const& hw_info,
-    Arguments const& args,
-    void* workspace,
-    const uint32_t epilogue_subtile = 1) {
+      ProblemShape problem_shape,
+      TileShape tile_shape,
+      ClusterShape cluster_shape,
+      KernelHardwareInfo const& hw_info,
+      Arguments const& args,
+      void* workspace,
+      const uint32_t epilogue_subtile = 1,
+      [[maybe_unused]] uint32_t ktile_start_alignment_count = 1u) {
 
     static_assert(cute::is_static<TileShape>::value);
     static_assert(cute::is_static<ClusterShape>::value);
@@ -305,6 +308,21 @@ class PersistentTileSchedulerSm90StreamK {
     current_work_linear_idx_ += uint64_t(GridDimX()) * uint64_t(GridDimY()) * uint64_t(GridDimZ()) * uint64_t(advance_count);
   }
 
+  CUTLASS_DEVICE
+  bool is_last_tile(WorkTileInfo work_tile_info, uint32_t advance_count = 1) const {
+     // Never pass this by reference; it needs a copy,
+    // because continue_current_work will modify it.
+    if (continue_current_work(work_tile_info)) {
+      return false;
+    }
+    return not get_current_work_for_linear_idx(
+        current_work_linear_idx_ + (
+          uint64_t(GridDimX()) * uint64_t(GridDimY()) * uint64_t(GridDimZ()) * uint64_t(advance_count)
+          ),
+        scheduler_params
+    ).is_valid();
+  }
+
   // Given the inputs, computes the total number of output blocks this problem will compute over
   // Note that this is only the logical size of our grid, not the physical grid we will actually launch.
   template <class ProblemShape>
@@ -319,6 +337,7 @@ class PersistentTileSchedulerSm90StreamK {
   CUTLASS_HOST_DEVICE static
   dim3
   get_grid_shape(
+    [[maybe_unused]] Params const& params,
     ProblemShape problem_shape,
     TileShape tile_shape,
     ClusterShape cluster_shape,
@@ -393,28 +412,30 @@ class PersistentTileSchedulerSm90StreamK {
     if (!requires_fixup(params, work_tile_info)) {
       return;
     }
-    auto tile_idx = output_tile_index(params, work_tile_info);
+    uint64_t tile_idx = output_tile_index(params, work_tile_info);
 
     // Index of the lock on which to wait
-    auto lock_idx = (tile_idx * num_barriers) + barrier_idx;
+    uint64_t lock_idx = (tile_idx * num_barriers) + barrier_idx;
 
-    auto reduction_tile_idx = tile_idx;
-    auto [first_peer_id, my_peer_id, last_peer_id] = tile_peer_range(params, tile_idx, static_cast<uint32_t>(work_tile_info.K_idx));
-    auto reduction_peer_offset = 0;
+    uint64_t reduction_tile_idx = tile_idx;
+    uint64_t num_peers = 0;
+    uint64_t reduction_peer_offset = 0;
     if (params.requires_separate_reduction()) {
       // If separate reduction is to be performed, each stream-K unit writes its partials
       // to a separate portion of the workspace. There are as many of these portions as there
       // are peers for a given output tile, so we multiply the tile index by the maximum peer count.
+      auto [first_peer_id, my_peer_id, last_peer_id] = tile_peer_range(params, tile_idx, static_cast<uint32_t>(work_tile_info.K_idx));
+      num_peers = last_peer_id - first_peer_id + 1;
       reduction_tile_idx *= Params::max_peers_per_tile(params.sk_units_, params.sk_tiles_);
       reduction_peer_offset = my_peer_id * cute::size<0>(TileShape{}) * cute::size<1>(TileShape{});
     }
 
     // Reductions use BlockStripedReduce with a width of BarrierManager::ThreadCount under the hood.
     // Thus, the start of the reduction space is the same across all threads in a warp group.
-    int reduction_offset =
-      (cute::size<0>(TileShape{}) * cute::size<1>(TileShape{}) * reduction_tile_idx * num_accumulator_mtxs) +
+    uint64_t reduction_offset =
+      (static_cast<uint64_t>(cute::size<0>(TileShape{})) * static_cast<uint64_t>(cute::size<1>(TileShape{})) * reduction_tile_idx * num_accumulator_mtxs) +
       reduction_peer_offset +
-      (size(accumulators) * barrier_idx * BarrierManager::ThreadCount);
+      (static_cast<uint64_t>(size(accumulators)) * barrier_idx * BarrierManager::ThreadCount);
 
     ElementAccumulator* group_reduction_workspace = reinterpret_cast<ElementAccumulator*>(params.reduction_workspace_) + reduction_offset;
 
@@ -424,7 +445,7 @@ class PersistentTileSchedulerSm90StreamK {
     AccumulatorArrayT* reduction_workspace_array = reinterpret_cast<AccumulatorArrayT*>(group_reduction_workspace);
     AccumulatorArrayT* accumulator_array = reinterpret_cast<AccumulatorArrayT*>(accumulators.data());
 
-    int barrier_group_thread_idx = ThreadIdxX() % BarrierManager::ThreadCount;
+    uint32_t barrier_group_thread_idx = ThreadIdxX() % BarrierManager::ThreadCount;
 
     // The number of tiles for which reduction is required is either:
     //   (a) the total number of output tiles (in the case of split-K)
@@ -443,24 +464,23 @@ class PersistentTileSchedulerSm90StreamK {
       reduction_tiles = params.sk_tiles_;
     }
 
-    auto reduction_workspace_size = Params::get_reduction_workspace_size(
+    uint64_t reduction_workspace_size = Params::get_reduction_workspace_size(
       reduction_tiles, to_gemm_coord(TileShape{}), sizeof_bits<ElementAccumulator>::value, num_accumulator_mtxs);
     BarrierType* lock_workspace = reinterpret_cast<BarrierType*>(
       reinterpret_cast<uint8_t*>(params.reduction_workspace_) + reduction_workspace_size);
 
     if (work_tile_info.is_reduction_unit()) {
       plus<AccumulatorArrayT> add_fragments;
-      auto peer_offset = size(accumulators) * num_barriers * BarrierManager::ThreadCount;
+      uint64_t peer_offset = size(accumulators) * num_barriers * BarrierManager::ThreadCount;
 
       // Wait until the peers collaborating on this output tile have all written
       // their accumulators to workspace.
-      uint32_t num_peers = last_peer_id - first_peer_id + 1;
       BarrierManager::wait_eq(barrier_idx, lock_workspace, barrier_group_thread_idx, lock_idx, num_peers);
 
       // Load the first peer's data
       BlockStripedReduceT::load(*accumulator_array, reduction_workspace_array, barrier_group_thread_idx);
 
-      for (int i = 1; i < num_peers; ++i) {
+      for (uint64_t i = 1; i < num_peers; ++i) {
         // Load peer fragment
         AccumulatorArrayT addend_fragment;
         auto peer_reduction_workspace = reinterpret_cast<AccumulatorArrayT*>(group_reduction_workspace + (i * peer_offset));
@@ -487,7 +507,7 @@ class PersistentTileSchedulerSm90StreamK {
 
       // If separate reduction is being performed, each participating stream-K unit increments the barrier
       // by only 1. Otherwise, increment by the K tile count that this unit has processed.
-      int32_t increment = params.requires_separate_reduction() ? 1 : work_tile_info.k_tile_count;
+      uint32_t increment = params.requires_separate_reduction() ? 1 : work_tile_info.k_tile_count;
 
       // Signal our arrival
       BarrierManager::arrive_inc(barrier_idx, lock_workspace, barrier_group_thread_idx, lock_idx, increment);
@@ -524,7 +544,7 @@ class PersistentTileSchedulerSm90StreamK {
 
   // Returns the linearized index of the output tile corresponding to the tile with offset [L, M, K]
   CUTLASS_DEVICE
-  static int
+  static uint64_t
   output_tile_index(Params const& params, WorkTileInfo const& work_tile_info) {
     uint64_t linear_idx_in_batch = UnderlyingScheduler::get_linear_idx_from_m_and_n(
       work_tile_info.M_idx, work_tile_info.N_idx,
@@ -546,7 +566,8 @@ class PersistentTileSchedulerSm90StreamK {
     ProblemShape problem_shape,
     KernelHardwareInfo const& hw_info,
     uint32_t mma_warp_groups,
-    const uint32_t epilogue_subtile = 1) {
+    const uint32_t epilogue_subtile = 1,
+    [[maybe_unused]] uint32_t num_accumulator_mtxs = 1) {
 
     auto problem_shape_mnkl = cute::append<4>(problem_shape, 1);
 
@@ -583,6 +604,7 @@ class PersistentTileSchedulerSm90StreamK {
     KernelHardwareInfo const& hw_info,
     uint32_t mma_warp_groups,
     const uint32_t epilogue_subtile = 1,
+    [[maybe_unused]] uint32_t num_accumulator_mtxs = 1,
     CudaHostAdapter* cuda_adapter = nullptr) {
 
     auto problem_shape_mnkl = cute::append<4>(problem_shape, 1);
@@ -616,7 +638,7 @@ class PersistentTileSchedulerSm90StreamK {
 
   template <class ProblemShape>
   CUTLASS_HOST_DEVICE
-  static int
+  static uint32_t
   get_work_k_tile_count(WorkTileInfo const& work_tile_info, ProblemShape, TileShape) {
     return work_tile_info.k_tile_count;
   }
@@ -632,11 +654,11 @@ class PersistentTileSchedulerSm90StreamK {
   auto
   fetch_next_work(WorkTileInfo work_tile_info) {
     if (continue_current_work(work_tile_info)) {
-      return work_tile_info;
+      return cute::make_tuple(work_tile_info, true);
     }
 
     advance_to_next_work();
-    return get_current_work();
+    return cute::make_tuple(get_current_work(), true);
   }
 
   // Returns the initial work tile info that will be computed over
@@ -665,7 +687,7 @@ class PersistentTileSchedulerSm90StreamK {
       // Separate-reduction work
       auto cluster_size = params.get_cluster_size();
       // Divide up the linearized separate reduction units into clusters
-      auto cluster_linear_reduction_unit_idx = params.div_cluster_size((linear_idx - params.units_per_problem_));
+      uint64_t cluster_linear_reduction_unit_idx = params.div_cluster_size((linear_idx - params.units_per_problem_));
       uint64_t cluster_tile_idx, epi_subtile_idx;
       params.divmod_epilogue_subtile_(cluster_tile_idx, epi_subtile_idx, cluster_linear_reduction_unit_idx);
       // Bring the linearized tile ID back into the space of tiles, rather than clusters
@@ -697,35 +719,35 @@ class PersistentTileSchedulerSm90StreamK {
       // To do so, we divide up the linearized stream-K units into clusters and share the same K
       // offsets for work within clusters.
 
-      auto cluster_linear_work_idx = params.div_cluster_size(linear_idx);
+      uint64_t cluster_linear_work_idx = params.div_cluster_size(linear_idx);
 
       uint64_t group_idx;
       params.divmod_sk_groups_(cluster_linear_work_idx, group_idx, cluster_linear_work_idx);
 
       // Determine whether we are in a "big group" that will process an additional
       // stream-K cluster tile.
-      auto sk_cluster_tiles = params.div_cluster_size(params.sk_tiles_);
-      auto sk_cluster_tiles_in_group = params.divmod_sk_groups_.divide(sk_cluster_tiles);
+      uint64_t sk_cluster_tiles = params.div_cluster_size(params.sk_tiles_);
+      uint64_t sk_cluster_tiles_in_group = params.divmod_sk_groups_.divide(sk_cluster_tiles);
       if (group_idx < params.big_groups_) {
         ++sk_cluster_tiles_in_group;
       }
 
       // Determine whether we are in a "big unit" within the group, that will process
       // an additional K chunk in the group.
-      auto sk_tiles_in_group = sk_cluster_tiles_in_group * params.get_cluster_size();
-      auto k_tiles_in_group = sk_tiles_in_group * params.divmod_tiles_per_output_tile_.divisor;
-      auto k_tiles_per_unit_in_group = params.divmod_sk_units_per_group_.divide(k_tiles_in_group);
-      auto big_units_in_group = params.div_cluster_size(
+      uint64_t sk_tiles_in_group = sk_cluster_tiles_in_group * params.get_cluster_size();
+      uint64_t k_tiles_in_group = sk_tiles_in_group * params.divmod_tiles_per_output_tile_.divisor;
+      uint64_t k_tiles_per_unit_in_group = params.divmod_sk_units_per_group_.divide(k_tiles_in_group);
+      uint64_t big_units_in_group = params.div_cluster_size(
         k_tiles_in_group - (k_tiles_per_unit_in_group * params.divmod_sk_units_per_group_.divisor));
 
       uint64_t split;
       params.divmod_clusters_mnl_(split, cluster_linear_work_idx, cluster_linear_work_idx);
 
       bool is_split_k = params.divmod_splits_.divisor > 1;
-      auto big_unit_cmp_lhs = is_split_k ? split : cluster_linear_work_idx;
-      auto big_unit_cmp_rhs = is_split_k ? params.big_units_ : big_units_in_group;
-      auto linear_idx_mult = is_split_k ? params.divmod_tiles_per_output_tile_.divisor : k_tiles_per_unit_in_group;
-      auto k_tiles_per_split = is_split_k ? params.divmod_k_tiles_per_sk_unit_.divisor : k_tiles_per_unit_in_group;
+      uint64_t big_unit_cmp_lhs = is_split_k ? split : cluster_linear_work_idx;
+      uint64_t big_unit_cmp_rhs = is_split_k ? params.big_units_ : big_units_in_group;
+      uint64_t linear_idx_mult = is_split_k ? params.divmod_tiles_per_output_tile_.divisor : k_tiles_per_unit_in_group;
+      uint64_t k_tiles_per_split = is_split_k ? params.divmod_k_tiles_per_sk_unit_.divisor : k_tiles_per_unit_in_group;
 
       // Determine the starting k iteration computed by this stream-K work unit
       uint32_t unit_iter_start = (linear_idx_mult * cluster_linear_work_idx) +
@@ -889,14 +911,14 @@ class PersistentTileSchedulerSm90StreamK {
   CUTLASS_HOST_DEVICE
   static auto
   tile_peer_range(Params const& params, uint32_t tile_idx, uint32_t cur_k_tile) {
-    auto tile_idx_in_cluster_path = params.div_cluster_size(tile_idx);
-    auto start_k_tile = params.divmod_tiles_per_output_tile_.divisor * tile_idx_in_cluster_path;
-    auto end_k_tile = start_k_tile + params.divmod_tiles_per_output_tile_.divisor - 1;
-    auto big_unit_k_tiles = params.big_units_ * (params.divmod_k_tiles_per_sk_unit_.divisor + 1);
+    uint32_t tile_idx_in_cluster_path = params.div_cluster_size(tile_idx);
+    uint32_t start_k_tile = params.divmod_tiles_per_output_tile_.divisor * tile_idx_in_cluster_path;
+    uint32_t end_k_tile = start_k_tile + params.divmod_tiles_per_output_tile_.divisor - 1;
+    uint32_t big_unit_k_tiles = params.big_units_ * (params.divmod_k_tiles_per_sk_unit_.divisor + 1);
 
     auto adjust_unit = [&](uint32_t k_tile, uint32_t unit_idx, uint32_t k_tiles_per_unit) {
-      auto unit_k_start = unit_idx * k_tiles_per_unit;
-      auto unit_k_end = unit_k_start + k_tiles_per_unit;
+      uint32_t unit_k_start = unit_idx * k_tiles_per_unit;
+      uint32_t unit_k_end = unit_k_start + k_tiles_per_unit;
       if (k_tile - start_k_tile < Params::min_iters_per_sk_unit_ &&
           unit_k_end - start_k_tile < Params::min_iters_per_sk_unit_) {
         // k_tile is within the first min_iters_per_sk_unit_ K tiles of this output tile,
@@ -920,13 +942,13 @@ class PersistentTileSchedulerSm90StreamK {
     auto find_unit = [&](uint32_t k_tile) {
       if (k_tile < big_unit_k_tiles) {
         // The tile is within the "big unit range"
-        auto unit_idx = params.divmod_k_tiles_per_sk_big_unit_.divide(k_tile);
+        uint32_t unit_idx = params.divmod_k_tiles_per_sk_big_unit_.divide(k_tile);
         return static_cast<uint64_t>(adjust_unit(k_tile, unit_idx, params.divmod_k_tiles_per_sk_big_unit_.divisor));
       }
       else {
         // The tile is after the "big unit range." Account for this by finding the "normal unit"
         // that it belongs to, and then offsetting by the number of big units
-        auto unit_idx = params.divmod_k_tiles_per_sk_unit_.divide(k_tile - big_unit_k_tiles) + params.big_units_;
+        uint32_t unit_idx = params.divmod_k_tiles_per_sk_unit_.divide(k_tile - big_unit_k_tiles) + params.big_units_;
         return static_cast<uint64_t>(adjust_unit(k_tile, unit_idx, params.divmod_k_tiles_per_sk_unit_.divisor));
       }
     };
diff --git a/include/cutlass/gemm/kernel/static_tile_scheduler.hpp b/include/cutlass/gemm/kernel/static_tile_scheduler.hpp
index b0af23c43c..67d346e3b2 100644
--- a/include/cutlass/gemm/kernel/static_tile_scheduler.hpp
+++ b/include/cutlass/gemm/kernel/static_tile_scheduler.hpp
@@ -46,9 +46,6 @@ namespace cutlass::gemm::kernel::detail {
 // This is a CRTP base class for the actual tile schedulers.
 template<class Subclass>
 class StaticPersistentTileScheduler {
-  //
-  // Data members
-  //
 
 private:
   uint64_t current_work_linear_idx_;
@@ -89,6 +86,8 @@ class StaticPersistentTileScheduler {
   using Params = PersistentTileSchedulerSm90Params;
   using RasterOrder = typename Params::RasterOrder;
   using RasterOrderOptions = typename Params::RasterOrderOptions;
+  static constexpr bool IsDynamicPersistent = false;
+
 public:
   struct Arguments {
     int max_swizzle_size = 1;
@@ -98,13 +97,14 @@ class StaticPersistentTileScheduler {
   template <class ProblemShapeMNKL, class TileShape, class ClusterShape>
   static Params
   to_underlying_arguments(
-    ProblemShapeMNKL problem_shape_mnkl,
-    TileShape tile_shape,
-    ClusterShape cluster_shape,
-    [[maybe_unused]] KernelHardwareInfo const& hw_info,
-    Arguments const& arguments,
-    [[maybe_unused]] void* workspace=nullptr,
-    [[maybe_unused]] const uint32_t epilogue_subtile = 1) {
+      ProblemShapeMNKL problem_shape_mnkl,
+      TileShape tile_shape,
+      ClusterShape cluster_shape,
+      [[maybe_unused]] KernelHardwareInfo const& hw_info,
+      Arguments const& arguments,
+      [[maybe_unused]] void* workspace=nullptr,
+      [[maybe_unused]] const uint32_t epilogue_subtile = 1,
+      [[maybe_unused]] uint32_t ktile_start_alignment_count = 1u) {
 
     // We only need the tile and cluster shape during scheduler setup, so let FTAD do the magic
     static_assert(cute::is_static<TileShape>::value);
@@ -193,6 +193,16 @@ class StaticPersistentTileScheduler {
     current_work_linear_idx_ += total_grid_size_ * uint64_t(advance_count);
   }
 
+  CUTLASS_DEVICE
+  bool is_last_tile(WorkTileInfo& work_tile_info, uint32_t advance_count = 1) const {
+    if (continue_current_work(work_tile_info)) {
+      return false;
+    }
+    return not get_current_work_for_linear_idx(
+        current_work_linear_idx_ + (total_grid_size_ * uint64_t(advance_count))
+    ).is_valid();
+  }
+
   // Computes the linear index within a batch given M and N tile offsets within the batch.
   // This essentially inverts the mapping performed in get_work_idx_m_and_n
   static CUTLASS_DEVICE
@@ -249,6 +259,39 @@ class StaticPersistentTileScheduler {
     );
   }
 
+  // Reloaded interface that receives WorkTileInfo to deduce next work.
+  // Kernel helper function to get next work tile
+  CUTLASS_DEVICE
+  auto
+  fetch_next_work(WorkTileInfo work_tile_info) {
+    if (continue_current_work(work_tile_info)) {
+      return cute::make_tuple(work_tile_info, true);
+    }
+
+    advance_to_next_work();
+    return cute::make_tuple(get_current_work(), true);
+  }
+  
+  // Given the inputs, computes the total number of output blocks over which this problem will compute.
+  // Note that this is only the logical size of our grid, not the physical grid we will actually launch.
+  template<class ProblemShapeMNKL, class TileShape, class AtomThrShape, class ClusterShape>
+  CUTLASS_HOST_DEVICE static
+  dim3
+  get_tiled_cta_shape_mnl(ProblemShapeMNKL problem_shape_mnkl,
+                          TileShape tile_shape_mnk,
+                          AtomThrShape atom_thr_shape_mnk,
+                          ClusterShape cluster_shape_mnk) {
+    auto [tiles_m, tiles_n, tiles_l] = product_each(ceil_div(select<0,1,3>(problem_shape_mnkl), take<0,2>(tile_shape_mnk)));
+    auto cta_m = round_nearest(tiles_m * size<0>(atom_thr_shape_mnk), size<0>(cluster_shape_mnk));
+    auto cta_n = round_nearest(tiles_n * size<1>(atom_thr_shape_mnk), size<1>(cluster_shape_mnk));
+
+    return Params::get_tiled_cta_shape_mnl(
+      to_gemm_coord(problem_shape_mnkl),
+      to_gemm_coord(cluster_shape_mnk),
+      cta_m, cta_n
+    );
+  }
+
   CUTLASS_DEVICE
   static auto
   work_tile_to_cta_coord(WorkTileInfo work_tile_info) {
@@ -262,17 +305,31 @@ class StaticPersistentTileScheduler {
     );
   }
 
+  CUTLASS_DEVICE
+  static auto
+  work_tile_to_cta_coord(WorkTileInfo work_tile_info, dim3 block_id_in_cluster) {
+    // Get every cta coord in three dimensions of the cluster
+    auto [cta_m_in_cluster, cta_n_in_cluster, cta_l_in_cluster] = block_id_in_cluster;
+    return make_coord(
+      work_tile_info.M_idx + static_cast<int32_t>(cta_m_in_cluster),
+      work_tile_info.N_idx + static_cast<int32_t>(cta_n_in_cluster),
+      _,
+      work_tile_info.L_idx + static_cast<int32_t>(cta_l_in_cluster)
+    );
+  }
+
   // Given the inputs, computes the physical grid we should launch.
   template<class ProblemShapeMNKL, class BlockShape, class ClusterShape>
   CUTLASS_HOST_DEVICE static
   dim3
   get_grid_shape(
-    ProblemShapeMNKL problem_shape_mnk,
-    BlockShape cta_shape,
-    ClusterShape cluster_shape,
-    KernelHardwareInfo hw_info,
-    Arguments arguments,
-    bool truncate_by_problem_size=true) {
+      [[maybe_unused]] Params const& params,
+      ProblemShapeMNKL problem_shape_mnk,
+      BlockShape cta_shape,
+      ClusterShape cluster_shape,
+      KernelHardwareInfo hw_info,
+      Arguments arguments = Arguments{},
+      bool truncate_by_problem_size=true) {
 
     auto problem_shape_mnkl = cute::append<4>(problem_shape_mnk, cute::Int<1>{});
     dim3 problem_blocks = get_tiled_cta_shape_mnl(problem_shape_mnkl, cta_shape, cluster_shape);
@@ -288,19 +345,17 @@ class StaticPersistentTileScheduler {
   }
 
   // Given the inputs, computes the physical grid we should launch.
-  template<class ProblemShapeMNKL, class BlockShape, class ClusterShape>
-  CUTLASS_HOST_DEVICE static
-  dim3
+  template<class ProblemShapeMNKL, class TileShape, class AtomThrShape, class ClusterShape>
+  static dim3
   get_grid_shape(
-    Params const& params,
-    ProblemShapeMNKL problem_shape_mnk,
-    BlockShape cta_shape,
-    ClusterShape cluster_shape,
-    KernelHardwareInfo hw_info) {
-
-    auto problem_shape_mnkl = cute::append<4>(problem_shape_mnk, cute::Int<1>{});
-    dim3 problem_blocks = get_tiled_cta_shape_mnl(problem_shape_mnkl, cta_shape, cluster_shape);
-
+      Params const& params,
+      ProblemShapeMNKL problem_shape_mnkl,
+      TileShape tile_shape_mnk,
+      AtomThrShape atom_thr_shape_mnk,
+      ClusterShape cluster_shape_mnk,
+      KernelHardwareInfo hw_info) {
+
+    dim3 problem_blocks = get_tiled_cta_shape_mnl(problem_shape_mnkl, tile_shape_mnk, atom_thr_shape_mnk, cluster_shape_mnk);
     Arguments args{};
     if constexpr (!std::is_const_v<decltype(args.max_swizzle_size)>) {
       args.max_swizzle_size = 1 << params.log_swizzle_size_;
@@ -309,7 +364,7 @@ class StaticPersistentTileScheduler {
 
     return Params::get_grid_shape(
       problem_blocks,
-      to_gemm_coord(cluster_shape),
+      to_gemm_coord(cluster_shape_mnk),
       hw_info,
       args.max_swizzle_size,
       args.raster_order,
@@ -368,6 +423,14 @@ class StaticPersistentTileScheduler {
     return false;
   }
 
+  template <class ProblemShapeMNKL, class TileShape, class Shape>
+  CUTLASS_DEVICE
+  auto
+  get_k_tile_iterator(WorkTileInfo const& work_tile_info, ProblemShapeMNKL problem_shape_MNKL, TileShape tile_shape, Shape) {
+    auto k_tiles = cute::ceil_div(cute::get<2>(problem_shape_MNKL), cute::get<2>(tile_shape));
+    return cute::make_coord_iterator(k_tiles);
+  }
+
   template <class ProblemShape, class TileShape>
   CUTLASS_HOST_DEVICE
   static int
@@ -430,6 +493,7 @@ class StaticPersistentTileScheduler {
   requires_separate_reduction(Params const& params) {
     return false;
   }
+
 public:
   // Sink scheduler params as a member
   Params scheduler_params;
diff --git a/include/cutlass/gemm/kernel/tile_scheduler.hpp b/include/cutlass/gemm/kernel/tile_scheduler.hpp
index 33fc7d5015..1f8927c290 100644
--- a/include/cutlass/gemm/kernel/tile_scheduler.hpp
+++ b/include/cutlass/gemm/kernel/tile_scheduler.hpp
@@ -89,10 +89,10 @@ template <
   class ClusterShape
 >
 struct TileSchedulerSelector<
-  PersistentScheduler,
-  ArchTag,
-  TileShape,
-  ClusterShape
+    PersistentScheduler,
+    ArchTag,
+    TileShape,
+    ClusterShape
   > {
   using Scheduler = PersistentTileSchedulerSm90;
 };
@@ -104,16 +104,16 @@ template <
   class ClusterShape
 >
 struct TileSchedulerSelector<
-  void,
-  ArchTag,
-  TileShape,
-  ClusterShape
-  > {
-  using Scheduler = typename TileSchedulerSelector<
-    PersistentScheduler,
+    void,
     ArchTag,
     TileShape,
     ClusterShape
+  > {
+  using Scheduler = typename TileSchedulerSelector<
+      PersistentScheduler,
+      ArchTag,
+      TileShape,
+      ClusterShape
   >::Scheduler;
 };
 
@@ -122,10 +122,10 @@ template <
   class ClusterShape
 >
 struct TileSchedulerSelector<
-  StreamKScheduler,
-  arch::Sm90,
-  TileShape,
-  ClusterShape
+    StreamKScheduler,
+    arch::Sm90,
+    TileShape,
+    ClusterShape
   > {
   using Scheduler = PersistentTileSchedulerSm90StreamK<TileShape, ClusterShape>;
 };
@@ -151,11 +151,11 @@ template <
   , class GroupProblemShape
 >
 struct TileSchedulerSelector<
-  GroupScheduler,
-  arch::Sm90,
-  TileShape,
-  ClusterShape
-  , GroupProblemShape
+    GroupScheduler,
+    arch::Sm90,
+    TileShape,
+    ClusterShape
+    , GroupProblemShape
   > {
   using Scheduler = PersistentTileSchedulerSm90Group<GroupProblemShape>;
 };
diff --git a/include/cutlass/gemm/kernel/tile_scheduler_params.h b/include/cutlass/gemm/kernel/tile_scheduler_params.h
index 36888a29fa..0972731c2b 100644
--- a/include/cutlass/gemm/kernel/tile_scheduler_params.h
+++ b/include/cutlass/gemm/kernel/tile_scheduler_params.h
@@ -1046,12 +1046,9 @@ struct PersistentTileSchedulerSm90StreamKParams {
   // Returns whether the kernel is configured in a manner for which separate reduction should be used
   CUTLASS_HOST_DEVICE
   static bool
-  should_perform_separate_reduction(uint32_t epilogue_subtile, uint64_t sk_units, uint64_t sk_tiles, uint64_t dp_tiles, uint64_t ctas_per_wave) {
-    // We perform separate reduction if we have fewer than one wave of output tiles
-    // and each output tile is covered by at least to stream-K units. When sk_units is
-    // multiple of sk_tiles, will choose basic split-k path instead of separate reduction for now.
-    return (epilogue_subtile != 1) && (dp_tiles == 0) && (sk_units > 2u * sk_tiles) &&
-           (sk_units + sk_tiles * epilogue_subtile <= ctas_per_wave);
+  should_perform_separate_reduction(uint32_t, uint64_t, uint64_t, uint64_t, uint64_t) {
+    // Separate reduction is temporarily disabled, pending fixes
+    return false;
   }
 
   // Get the amount of scratch workspace needed for the kernel. This variant of the method should only be used when
diff --git a/include/cutlass/gemm/warp/default_mma_tensor_op_sm80.h b/include/cutlass/gemm/warp/default_mma_tensor_op_sm80.h
index d7e3232c81..67fcde77e5 100644
--- a/include/cutlass/gemm/warp/default_mma_tensor_op_sm80.h
+++ b/include/cutlass/gemm/warp/default_mma_tensor_op_sm80.h
@@ -251,10 +251,10 @@ template <
     /// when output layout is interleaved.
     bool AccumulatorsInRowMajor>
 struct DefaultMmaTensorOp<
-  WarpShape_, 
+  WarpShape_,
   GemmShape<16, 8, 16>,                 // InstructionShape
   ElementA,                             // Element type of A matrix in Global Memory
-  LayoutA,                              // Layout of A matrix in Global Memory 
+  LayoutA,                              // Layout of A matrix in Global Memory
   ElementB,                             // Element type of B matrix in Global Memory
   LayoutB,                              // Layout of B matrix in Global Memory
   ElementC,                             // Element type of C matrix in Global Memory
@@ -264,11 +264,11 @@ struct DefaultMmaTensorOp<
 
 
   // Check if the ElementA and ElementB are of different data types
-  static_assert(!platform::is_same<ElementA, ElementB>::value, 
+  static_assert(!platform::is_same<ElementA, ElementB>::value,
     "DefaultMmaTensorOp with arch::OpMultiplyAddMixedInputUpcast ElementA and ElementB cannot be of the same data type");
 
   // Data type used for internal computation - use the wider of the two data types for mma.sync operands
-  using ElementOperand = typename platform::conditional<(sizeof(ElementA) > sizeof(ElementB)), 
+  using ElementOperand = typename platform::conditional<(sizeof_bits<ElementA>::value > sizeof_bits<ElementB>::value),
                                                     ElementA, ElementB>::type;
 
   // Operand datatypes in the internal MMA instruction - use the wider of the two data types
@@ -276,14 +276,14 @@ struct DefaultMmaTensorOp<
   using ElementBMma = ElementOperand;
   using MmaElementC = ElementC;
 
-  // Uses 
+  // Uses
   using Policy = cutlass::gemm::warp::MmaTensorOpPolicy<
       cutlass::arch::Mma<
-        GemmShape<16, 8, 16>, 
-        32, 
-        ElementAMma, cutlass::layout::RowMajor, 
+        GemmShape<16, 8, 16>,
+        32,
+        ElementAMma, cutlass::layout::RowMajor,
         ElementBMma, cutlass::layout::ColumnMajor,
-        MmaElementC, cutlass::layout::RowMajor, 
+        MmaElementC, cutlass::layout::RowMajor,
         arch::OpMultiplyAdd
       >,
       cutlass::MatrixShape<1, 1> >;
@@ -296,6 +296,74 @@ struct DefaultMmaTensorOp<
 
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
+/// Partial Specialization - inputs are mixed types  - uses wider datatype internally.
+/// (e.g. S32 <= S4 x S8 + S32, S32 <= S8 x S4 + S32)
+template <
+    /// Shape of one matrix production operation (concept: GemmShape)
+    typename WarpShape_,
+    /// Element type of A matrix
+    typename ElementA,
+    /// Layout of A matrix (concept: MatrixLayout)
+    typename LayoutA,
+    /// Element type of B matrix
+    typename ElementB,
+    /// Layout of B matrix (concept: MatrixLayout)
+    typename LayoutB,
+    /// Element type of C matrix
+    typename ElementC,
+    /// Layout of C matrix (concept: MatrixLayout)
+    typename LayoutC,
+    /// Number of partitions along K dimension
+    int PartitionsK,
+    /// Store the accumulators in row major or column major.  Row major is used
+    /// when output layout is interleaved.
+    bool AccumulatorsInRowMajor>
+struct DefaultMmaTensorOp<
+  WarpShape_,
+  GemmShape<16, 8, 32>,                 // InstructionShape
+  ElementA,                             // Element type of A matrix in Global Memory
+  LayoutA,                              // Layout of A matrix in Global Memory
+  ElementB,                             // Element type of B matrix in Global Memory
+  LayoutB,                              // Layout of B matrix in Global Memory
+  ElementC,                             // Element type of C matrix in Global Memory
+  LayoutC,                              // Layout of C matrix in Global Memory
+  arch::OpMultiplyAddMixedInputUpcast,  // Tag to indicate mixed-input datatype, where narrower datatype is upcasted to wider datatype
+  PartitionsK, AccumulatorsInRowMajor> {
+
+
+  // Check if the ElementA and ElementB are of different data types
+  static_assert(!platform::is_same<ElementA, ElementB>::value,
+    "DefaultMmaTensorOp with arch::OpMultiplyAddMixedInputUpcast ElementA and ElementB cannot be of the same data type");
+
+  // Data type used for internal computation - use the wider of the two data types for mma.sync operands
+  using ElementOperand = typename platform::conditional<(sizeof_bits<ElementA>::value > sizeof_bits<ElementB>::value),
+                                                    ElementA, ElementB>::type;
+
+  // Operand datatypes in the internal MMA instruction - use the wider of the two data types
+  using MmaElementA = ElementOperand;
+  using MmaElementB = ElementOperand;
+  using MmaElementC = ElementC;
+
+  // Uses
+  using Policy = cutlass::gemm::warp::MmaTensorOpPolicy<
+      cutlass::arch::Mma<
+        GemmShape<16, 8, 32>,
+        32,
+        MmaElementA, cutlass::layout::RowMajor,
+        MmaElementB, cutlass::layout::ColumnMajor,
+        MmaElementC, cutlass::layout::RowMajor,
+        arch::OpMultiplyAddSaturate
+      >,
+      cutlass::MatrixShape<1, 1> >;
+
+  // Define the warp-level tensor op
+  using Type = cutlass::gemm::warp::MmaMixedInputTensorOp<
+      WarpShape_, ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC,
+      Policy, PartitionsK, AccumulatorsInRowMajor>;
+};
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
 } // namespace warp
 } // namespace gemm
 } // namespace cutlass
diff --git a/include/cutlass/gemm/warp/mma_mixed_input_tensor_op.h b/include/cutlass/gemm/warp/mma_mixed_input_tensor_op.h
index 81a841f889..6cb0e9c9ed 100644
--- a/include/cutlass/gemm/warp/mma_mixed_input_tensor_op.h
+++ b/include/cutlass/gemm/warp/mma_mixed_input_tensor_op.h
@@ -104,6 +104,7 @@ struct FragmentShuffler {
 ////////////////////////////////////////////////////////////////////////////////
 
 /// Partial specialization for `mma.sync` on 16b (F16/BF16) and `ldmatrix` on 8b (S8/U8)
+/// or for `mma.sync` on 8b (S8/U8) and `ldmatrix` on 4b (S4/U4)
 /// for operand A multiplicand going through upcasting. 
 template <
   /// Element type for the operand in registers for the mma.sync
@@ -122,8 +123,8 @@ struct FragmentShuffler <ElementMma_, ElementLoad_,
                          NumElementsInWarpFragment, 
                          NumElementsInMmaFragment,
                          Operand::kA,
-                         typename platform::enable_if<(sizeof_bits<ElementMma_>::value == 16) &&
-                                                 (sizeof_bits<ElementLoad_>::value == 8)>::type> {
+                         typename platform::enable_if<(sizeof_bits<ElementMma_>::value /
+                                                 sizeof_bits<ElementLoad_>::value == 2)>::type> {
 public:
   using ElementMma = ElementMma_;
   using ElementLoad = ElementLoad_;
@@ -187,6 +188,7 @@ struct FragmentShuffler <ElementMma_, ElementLoad_,
 ////////////////////////////////////////////////////////////////////////////////
 
 /// Partial specialization for `mma.sync` on 16b (F16/BF16) and `ldmatrix` on 8b (S8/U8)
+/// or for `mma.sync` on 8b (S8/U8) and `ldmatrix` on 4b (S4/U4)
 /// for operand B multiplicand going through upcasting. 
 template <
   /// Element type for the operand in registers for the mma.sync
@@ -205,8 +207,8 @@ struct FragmentShuffler <ElementMma_, ElementLoad_,
                          NumElementsInWarpFragment, 
                          NumElementsInMmaFragment,
                          Operand::kB,
-                         typename platform::enable_if<(sizeof_bits<ElementMma_>::value == 16) &&
-                                                 (sizeof_bits<ElementLoad_>::value == 8)>::type> {
+                         typename platform::enable_if<(sizeof_bits<ElementMma_>::value /
+                                                 sizeof_bits<ElementLoad_>::value == 2)>::type> {
 public:
   using ElementMma = ElementMma_;
   using ElementLoad = ElementLoad_;
diff --git a/include/cutlass/gemm/warp/mma_tensor_op.h b/include/cutlass/gemm/warp/mma_tensor_op.h
index b9212c43ab..d4aaf5be1d 100644
--- a/include/cutlass/gemm/warp/mma_tensor_op.h
+++ b/include/cutlass/gemm/warp/mma_tensor_op.h
@@ -138,7 +138,7 @@ struct ConvertAndPack<half_t, float, N, Round> {
 
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
-/// Structure to compute the matrix product targeting CUDA cores and SIMT math instructions.
+/// Structure to compute the matrix product targeting Tensor Cores.
 template <
   /// Size of the Gemm problem - concept: gemm::GemmShape<>
   typename Shape_,
diff --git a/include/cutlass/gemm/warp/mma_tensor_op_tile_iterator.h b/include/cutlass/gemm/warp/mma_tensor_op_tile_iterator.h
index 46690bf1ba..e6e6d70f3f 100644
--- a/include/cutlass/gemm/warp/mma_tensor_op_tile_iterator.h
+++ b/include/cutlass/gemm/warp/mma_tensor_op_tile_iterator.h
@@ -2315,7 +2315,17 @@ class MmaTensorOpMultiplicandTileIterator<
     int access_contiguous_idx = -1;
     int access_strided_idx = -1;
 
-    if (Layout::kFactor == 4) {
+    if (Layout::kFactor == 8) {
+      int factor_in_partition =
+          (Layout::PartitionShape::kContiguous * Layout::kFactor /
+           Layout::TileShape::kContiguous);
+
+      if (Policy::LdsmShape::kStrided == Policy::LdsmShape::kCount) {
+        partition_contiguous_idx = lane_in_quad_pair / factor_in_partition;
+        access_contiguous_idx = ((lane_in_quad) ^ (lane_id / Layout::kFactor));
+        access_strided_idx = lane_id / Layout::kFactor;
+      }
+    } else if (Layout::kFactor == 4) {
       // Super Integer matrix multiply Interleaved-32
 
       int factor_in_partition =
diff --git a/include/cutlass/gpu_generics.h b/include/cutlass/gpu_generics.h
index 44b5a92acb..c476a2fc67 100644
--- a/include/cutlass/gpu_generics.h
+++ b/include/cutlass/gpu_generics.h
@@ -295,6 +295,22 @@ unsigned int shfl_sync(
 #endif
 }
 
+CUTLASS_DEVICE
+unsigned int shfl_xor_sync(
+  unsigned int const mask,
+  unsigned int const var,
+  int const laneMask,
+  int const width = NumThreadsPerWarp) {
+#if defined(__CUDA_ARCH__)
+  return __shfl_xor_sync(mask, var, laneMask, width);
+#elif defined(__SYCL_DEVICE_ONLY__)
+  auto g = syclcompat::get_nd_item<1>().get_sub_group();
+  return syclcompat::permute_sub_group_by_xor(g, var, laneMask);
+#else
+  return 0;
+#endif
+}
+
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
 /*
diff --git a/include/cutlass/half.h b/include/cutlass/half.h
index 2ac9ec4674..10a80f04ec 100644
--- a/include/cutlass/half.h
+++ b/include/cutlass/half.h
@@ -612,30 +612,39 @@ struct numeric_limits<cutlass::half_t> {
   static int const digits = 10;
 
   /// Least positive value
+  CUTLASS_HOST_DEVICE
   static cutlass::half_t min() { return cutlass::half_t::bitcast(0x0001); }
 
   /// Minimum finite value
+  CUTLASS_HOST_DEVICE
   static cutlass::half_t lowest() { return cutlass::half_t::bitcast(0xfbff); }
 
   /// Maximum finite value
+  CUTLASS_HOST_DEVICE
   static cutlass::half_t max() { return cutlass::half_t::bitcast(0x7bff); }
 
   /// Returns smallest finite value
+  CUTLASS_HOST_DEVICE
   static cutlass::half_t epsilon() { return cutlass::half_t::bitcast(0x1800); }
 
   /// Returns maximum rounding error
+  CUTLASS_HOST_DEVICE
   static cutlass::half_t round_error() { return cutlass::half_t(0.5f); }
 
   /// Returns positive infinity value
+  CUTLASS_HOST_DEVICE
   static cutlass::half_t infinity() { return cutlass::half_t::bitcast(0x7c00); }
 
   /// Returns quiet NaN value
+  CUTLASS_HOST_DEVICE
   static cutlass::half_t quiet_NaN() { return cutlass::half_t::bitcast(0x7fff); }
 
   /// Returns signaling NaN value
+  CUTLASS_HOST_DEVICE
   static cutlass::half_t signaling_NaN() { return cutlass::half_t::bitcast(0x7fff); }
 
   /// Returns smallest positive subnormal value
+  CUTLASS_HOST_DEVICE
   static cutlass::half_t denorm_min() { return cutlass::half_t::bitcast(0x0001); }
 };
 }  // namespace std
@@ -919,12 +928,12 @@ half_t operator--(half_t & lhs, int) {
 //
 
 CUTLASS_HOST_DEVICE
-cutlass::half_t operator "" _hf(long double x) {
+cutlass::half_t operator ""_hf(long double x) {
   return cutlass::half_t(float(x));
 }
 
 CUTLASS_HOST_DEVICE
-cutlass::half_t operator "" _hf(unsigned long long int x) {
+cutlass::half_t operator ""_hf(unsigned long long int x) {
   return cutlass::half_t(int(x));
 }
 
diff --git a/include/cutlass/kernel_launch.h b/include/cutlass/kernel_launch.h
index ef67fc115f..77c8406c48 100644
--- a/include/cutlass/kernel_launch.h
+++ b/include/cutlass/kernel_launch.h
@@ -34,7 +34,11 @@
 
 #pragma once
 
+#if !defined(CUTLASS_ENABLE_SYCL)
+#include <cuda_runtime_api.h>
+#endif
 #include "cutlass/cutlass.h"
+#include "cutlass/trace.h"
 
 namespace cutlass {
 
@@ -68,6 +72,74 @@ struct KernelLaunchConfiguration {
     dynamic_smem(_dynamic_smem) { }
 };
 
+
+template <typename GemmKernel, typename Params>
+Status kernel_launch(
+    dim3 const grid_dims,
+    dim3 const block_dims,
+    size_t const smem_size,
+    cudaStream_t cuda_stream,
+    const Params &kernel_params,
+    bool launch_with_pdl) {
+#if (CUTLASS_DEBUG_TRACE_LEVEL > 1)
+  CUTLASS_TRACE_HOST("cutlass::kernel_launch");
+#endif
+
+  if (not launch_with_pdl) {
+#if (CUTLASS_DEBUG_TRACE_LEVEL > 1)
+    CUTLASS_TRACE_HOST("cutlass::kernel_launch: No PDL");
+#endif
+#if !defined(CUTLASS_ENABLE_SYCL)
+    device_kernel<GemmKernel><<<grid_dims, block_dims, smem_size, cuda_stream>>>(kernel_params);
+#endif
+  }
+  else {
+#if ((__CUDACC_VER_MAJOR__ >= 12) || ((__CUDACC_VER_MAJOR__ == 11) && (__CUDACC_VER_MINOR__ >= 8)))
+    if constexpr (GemmKernel::ArchTag::kMinComputeCapability < 90) {
+      CUTLASS_TRACE_HOST("  Programmatic dependent launch (PDL) is only supported for SM90.");
+      return Status::kInvalid;
+    }
+
+    cudaLaunchConfig_t config;
+    cudaLaunchAttribute attrs[1];
+
+    config.gridDim = grid_dims;
+    config.blockDim = block_dims;
+    config.dynamicSmemBytes = smem_size;
+    config.stream = cuda_stream;
+
+    config.attrs = attrs;
+    attrs[0].id = cudaLaunchAttributeProgrammaticStreamSerialization;
+    attrs[0].val.programmaticStreamSerializationAllowed = 1;
+    config.numAttrs = 1;
+
+#if (CUTLASS_DEBUG_TRACE_LEVEL > 1)
+    CUTLASS_TRACE_HOST("cutlass::kernel_launch: Calling cudaLaunchKernelEx");
+#endif
+    cudaError_t launch_result = cudaLaunchKernelEx(&config, &device_kernel<GemmKernel>, kernel_params);
+    if (cudaSuccess != launch_result) {
+      CUTLASS_TRACE_HOST("cutlass::kernel_launch: cudaLaunchKernelEx failed with error: " << cudaGetErrorString(launch_result));
+      return Status::kErrorInternal;
+    }
+#else
+    CUTLASS_TRACE_HOST("  Programmatic dependent launch (PDL) is only supported starting CUDA 11.8.");
+    return Status::kInvalid;
+#endif
+  }
+
+  cudaError_t result = cudaGetLastError();
+  if (cudaSuccess == result) {
+#if (CUTLASS_DEBUG_TRACE_LEVEL > 1)
+    CUTLASS_TRACE_HOST("cutlass::kernel_launch: cudaGetLastError reports success");
+#endif
+    return Status::kSuccess;
+  }
+  else {
+    CUTLASS_TRACE_HOST("  Kernel launch failed. Reason: " << result);
+    return Status::kErrorInternal;
+  }
+}
+
 ///////////////////////////////////////////////////////////////////////////////////////////////////
 
 } // namespace cutlass
diff --git a/include/cutlass/numeric_conversion.h b/include/cutlass/numeric_conversion.h
index 268ee8b32a..d157cfd6ad 100644
--- a/include/cutlass/numeric_conversion.h
+++ b/include/cutlass/numeric_conversion.h
@@ -45,6 +45,7 @@
 
 #include "cutlass/array.h"
 #include "cutlass/half.h"
+#include "cutlass/bfloat16.h"
 
 namespace cutlass {
 
@@ -1602,6 +1603,423 @@ struct NumericArrayConverter<float, cutlass::float_e5m2_t, 2, Round> {
     return convert(s);
   }
 };
+
+/// Partial specialization for Array<float_e5m2_t, 2> <= Array<float, 2>
+template <
+  FloatRoundStyle Round
+>
+struct NumericArrayConverter<float_e5m2_t, float, 2, Round> {
+  using result_element = cutlass::float_e5m2_t;
+  using source_element = float;
+
+  using result_type = Array<result_element, 2>;
+  using source_type = Array<source_element, 2>;
+  static FloatRoundStyle const round_style = Round;
+
+  CUTLASS_DEVICE
+  static result_type convert(source_type const & source) {
+
+  #if defined(CUDA_PTX_FP8_CVT_ENABLED)
+    uint16_t out;
+
+    asm volatile( \
+        "{\n" \
+        "cvt.rn.satfinite.e5m2x2.f32   %0, %2, %1;\n" \
+        "}" \
+        : "=h"(out) : "f"(source[0]), "f"(source[1]));
+
+    return reinterpret_cast<result_type const &>(out);
+  #else
+    result_type result;
+    NumericConverter<result_element, source_element, Round> converter;
+
+    CUTLASS_PRAGMA_UNROLL
+    for (int i = 0; i < 2; ++i) {
+      result[i] = converter(source[i]);
+    }
+
+    return result;
+  #endif
+  }
+
+  CUTLASS_HOST_DEVICE
+  result_type operator()(source_type const &s) const {
+    return convert(s);
+  }
+};
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+//
+// Partial specializations for Array<half, N> <=> Array<float_e4m3_t, N>
+//
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Partial specialization for Array<half, 2> <= Array<float_e4m3_t, 2>
+template <
+  FloatRoundStyle Round
+>
+struct NumericArrayConverter<cutlass::half_t, cutlass::float_e4m3_t, 2, Round> {
+  using result_element = cutlass::half_t;
+  using source_element = cutlass::float_e4m3_t;
+
+  using result_type = Array<result_element, 2>;
+  using source_type = Array<source_element, 2>;
+  static FloatRoundStyle const round_style = Round;
+
+  CUTLASS_DEVICE
+  static result_type convert(source_type const & source) {
+
+  #if defined(CUDA_PTX_FP8_CVT_ENABLED)
+    result_type out;
+    uint32_t& reg = reinterpret_cast<uint32_t&>(out);
+    uint16_t const& src_packed = reinterpret_cast<uint16_t const&>(source);
+
+    asm volatile( \
+        "{\n" \
+        "cvt.rn.f16x2.e4m3x2 %0, %1;\n" \
+        "}\n" : "=r"(reg): "h"(src_packed));
+
+    return out;
+  #else
+    result_type result;
+    NumericConverter<result_element, source_element, Round> converter;
+
+    CUTLASS_PRAGMA_UNROLL
+    for (int i = 0; i < 2; ++i) {
+      result[i] = converter(source[i]);
+    }
+
+    return result;
+  #endif
+  }
+
+  CUTLASS_HOST_DEVICE
+  result_type operator()(source_type const &s) const {
+    return convert(s);
+  }
+};
+
+/// Partial specialization for Array<float_e4m3_t, 2> <= Array<half, 2>
+template <
+  FloatRoundStyle Round
+>
+struct NumericArrayConverter<float_e4m3_t, cutlass::half_t, 2, Round> {
+  using result_element = cutlass::float_e4m3_t;
+  using source_element = cutlass::half_t;
+
+  using result_type = Array<result_element, 2>;
+  using source_type = Array<source_element, 2>;
+  static FloatRoundStyle const round_style = Round;
+
+  CUTLASS_DEVICE
+  static result_type convert(source_type const & source) {
+
+  #if defined(CUDA_PTX_FP8_CVT_ENABLED)
+    uint16_t out;
+
+    asm volatile( \
+        "{\n" \
+        "cvt.rn.satfinite.e4m3x2.f16x2   %0, %1;\n" \
+        "}" \
+        : "=h"(out) : "r"(reinterpret_cast<uint32_t const&>(source)));
+
+    return reinterpret_cast<result_type const &>(out);
+  #else
+    result_type result;
+    NumericConverter<result_element, source_element, Round> converter;
+
+    CUTLASS_PRAGMA_UNROLL
+    for (int i = 0; i < 2; ++i) {
+      result[i] = converter(source[i]);
+    }
+
+    return result;
+  #endif
+  }
+
+  CUTLASS_HOST_DEVICE
+  result_type operator()(source_type const &s) const {
+    return convert(s);
+  }
+};
+
+/// Partial specialization for Array<half, 2> <= Array<float_e5m2_t, 2>
+template <
+  FloatRoundStyle Round
+>
+struct NumericArrayConverter<cutlass::half_t, cutlass::float_e5m2_t, 2, Round> {
+  using result_element = cutlass::half_t;
+  using source_element = cutlass::float_e5m2_t;
+
+  using result_type = Array<result_element, 2>;
+  using source_type = Array<source_element, 2>;
+  static FloatRoundStyle const round_style = Round;
+
+  CUTLASS_DEVICE
+  static result_type convert(source_type const & source) {
+
+  #if defined(CUDA_PTX_FP8_CVT_ENABLED)
+    result_type out;
+    uint32_t& reg = reinterpret_cast<uint32_t&>(out);
+    uint16_t const& src_packed = reinterpret_cast<uint16_t const&>(source);
+
+    asm volatile( \
+        "{\n" \
+        "cvt.rn.f16x2.e5m2x2 %0, %1;\n" \
+        "}\n" : "=r"(reg): "h"(src_packed));
+
+    return out;
+  #else
+    result_type result;
+    NumericConverter<result_element, source_element, Round> converter;
+
+    CUTLASS_PRAGMA_UNROLL
+    for (int i = 0; i < 2; ++i) {
+      result[i] = converter(source[i]);
+    }
+
+    return result;
+  #endif
+  }
+
+  CUTLASS_HOST_DEVICE
+  result_type operator()(source_type const &s) const {
+    return convert(s);
+  }
+};
+
+/// Partial specialization for Array<float_e5m2_t, 2> <= Array<half, 2>
+template <
+  FloatRoundStyle Round
+>
+struct NumericArrayConverter<float_e5m2_t, cutlass::half_t, 2, Round> {
+  using result_element = cutlass::float_e5m2_t;
+  using source_element = cutlass::half_t;
+
+  using result_type = Array<result_element, 2>;
+  using source_type = Array<source_element, 2>;
+  static FloatRoundStyle const round_style = Round;
+
+  CUTLASS_DEVICE
+  static result_type convert(source_type const & source) {
+
+  #if defined(CUDA_PTX_FP8_CVT_ENABLED)
+    uint16_t out;
+
+    asm volatile( \
+        "{\n" \
+        "cvt.rn.satfinite.e5m2x2.f16x2   %0, %1;\n" \
+        "}" \
+        : "=h"(out) : "r"(reinterpret_cast<uint32_t const&>(source)));
+
+    return reinterpret_cast<result_type const &>(out);
+  #else
+    result_type result;
+    NumericConverter<result_element, source_element, Round> converter;
+
+    CUTLASS_PRAGMA_UNROLL
+    for (int i = 0; i < 2; ++i) {
+      result[i] = converter(source[i]);
+    }
+
+    return result;
+  #endif
+  }
+
+  CUTLASS_HOST_DEVICE
+  result_type operator()(source_type const &s) const {
+    return convert(s);
+  }
+};
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+//
+// Partial specializations for Array<bfloat16_t, N> <=> Array<float_e4m3_t, N>
+//
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+/// Partial specialization for Array<bfloat16_t, 2> <= Array<float_e4m3_t, 2>
+template <
+  FloatRoundStyle Round
+>
+struct NumericArrayConverter<cutlass::bfloat16_t, cutlass::float_e4m3_t, 2, Round> {
+  using result_element = cutlass::bfloat16_t;
+  using source_element = cutlass::float_e4m3_t;
+
+  using result_type = Array<result_element, 2>;
+  using source_type = Array<source_element, 2>;
+  static FloatRoundStyle const round_style = Round;
+
+  CUTLASS_DEVICE
+  static result_type convert(source_type const & source) {
+
+  #if defined(CUDA_PTX_FP8_CVT_ENABLED)
+    uint32_t res_half;
+    uint16_t const& src_packed = reinterpret_cast<uint16_t const&>(source);
+
+    asm volatile( \
+        "{\n" \
+        "cvt.rn.f16x2.e4m3x2 %0, %1;\n" \
+        "}\n" : "=r"(res_half): "h"(src_packed));
+    float2 res_float = __half22float2(reinterpret_cast<__half2 &>(res_half));
+    NumericArrayConverter<cutlass::bfloat16_t, float, 2, Round> converter;
+    return converter(reinterpret_cast<Array<float, 2> const&>(res_float));
+  #else
+    result_type result;
+    NumericConverter<result_element, source_element, Round> converter;
+
+    CUTLASS_PRAGMA_UNROLL
+    for (int i = 0; i < 2; ++i) {
+      result[i] = converter(source[i]);
+    }
+
+    return result;
+  #endif
+  }
+
+  CUTLASS_HOST_DEVICE
+  result_type operator()(source_type const &s) const {
+    return convert(s);
+  }
+};
+
+/// Partial specialization for Array<float_e4m3_t, 2> <= Array<bfloat16_t, 2>
+template <
+  FloatRoundStyle Round
+>
+struct NumericArrayConverter<float_e4m3_t, cutlass::bfloat16_t, 2, Round> {
+  using result_element = cutlass::float_e4m3_t;
+  using source_element = cutlass::bfloat16_t;
+
+  using result_type = Array<result_element, 2>;
+  using source_type = Array<source_element, 2>;
+  static FloatRoundStyle const round_style = Round;
+
+  CUTLASS_DEVICE
+  static result_type convert(source_type const & source) {
+
+  #if defined(CUDA_PTX_FP8_CVT_ENABLED)
+    NumericArrayConverter<float, cutlass::bfloat16_t, 2, Round> converter;
+    Array<float, 2> res_float = converter(source);
+    uint16_t out;
+
+    asm volatile( \
+        "{\n" \
+        "cvt.rn.satfinite.e4m3x2.f32   %0, %2, %1;\n" \
+        "}" \
+        : "=h"(out) : "f"(res_float[0]), "f"(res_float[1]));
+
+    return reinterpret_cast<result_type const &>(out);
+  #else
+    result_type result;
+    NumericConverter<result_element, source_element, Round> converter;
+
+    CUTLASS_PRAGMA_UNROLL
+    for (int i = 0; i < 2; ++i) {
+      result[i] = converter(source[i]);
+    }
+
+    return result;
+  #endif
+  }
+
+  CUTLASS_HOST_DEVICE
+  result_type operator()(source_type const &s) const {
+    return convert(s);
+  }
+};
+
+/// Partial specialization for Array<bfloat16_t, 2> <= Array<float_e5m2_t, 2>
+template <
+  FloatRoundStyle Round
+>
+struct NumericArrayConverter<cutlass::bfloat16_t, cutlass::float_e5m2_t, 2, Round> {
+  using result_element = cutlass::bfloat16_t;
+  using source_element = cutlass::float_e5m2_t;
+
+  using result_type = Array<result_element, 2>;
+  using source_type = Array<source_element, 2>;
+  static FloatRoundStyle const round_style = Round;
+
+  CUTLASS_DEVICE
+  static result_type convert(source_type const & source) {
+
+  #if defined(CUDA_PTX_FP8_CVT_ENABLED)
+    uint32_t res_half;
+    uint16_t const& src_packed = reinterpret_cast<uint16_t const&>(source);
+
+    asm volatile( \
+        "{\n" \
+        "cvt.rn.f16x2.e5m2x2 %0, %1;\n" \
+        "}\n" : "=r"(res_half): "h"(src_packed));
+    float2 res_float = __half22float2(reinterpret_cast<__half2 &>(res_half));
+    NumericArrayConverter<cutlass::bfloat16_t, float, 2, Round> converter;
+    return converter(reinterpret_cast<Array<float, 2> const&>(res_float));
+  #else
+    result_type result;
+    NumericConverter<result_element, source_element, Round> converter;
+
+    CUTLASS_PRAGMA_UNROLL
+    for (int i = 0; i < 2; ++i) {
+      result[i] = converter(source[i]);
+    }
+
+    return result;
+  #endif
+  }
+
+  CUTLASS_HOST_DEVICE
+  result_type operator()(source_type const &s) const {
+    return convert(s);
+  }
+};
+
+/// Partial specialization for Array<float_e5m2_t, 2> <= Array<bfloat16_t, 2>
+template <
+  FloatRoundStyle Round
+>
+struct NumericArrayConverter<float_e5m2_t, cutlass::bfloat16_t, 2, Round> {
+  using result_element = cutlass::float_e5m2_t;
+  using source_element = cutlass::bfloat16_t;
+
+  using result_type = Array<result_element, 2>;
+  using source_type = Array<source_element, 2>;
+  static FloatRoundStyle const round_style = Round;
+
+  CUTLASS_DEVICE
+  static result_type convert(source_type const & source) {
+
+  #if defined(CUDA_PTX_FP8_CVT_ENABLED)
+    NumericArrayConverter<float, cutlass::bfloat16_t, 2, Round> converter;
+    Array<float, 2> res_float = converter(source);
+    uint16_t out;
+
+    asm volatile( \
+        "{\n" \
+        "cvt.rn.satfinite.e5m2x2.f32   %0, %2, %1;\n" \
+        "}" \
+        : "=h"(out) : "f"(res_float[0]), "f"(res_float[1]));
+
+    return reinterpret_cast<result_type const &>(out);
+  #else
+    result_type result;
+    NumericConverter<result_element, source_element, Round> converter;
+
+    CUTLASS_PRAGMA_UNROLL
+    for (int i = 0; i < 2; ++i) {
+      result[i] = converter(source[i]);
+    }
+
+    return result;
+  #endif
+  }
+
+  CUTLASS_HOST_DEVICE
+  result_type operator()(source_type const &s) const {
+    return convert(s);
+  }
+};
+
 namespace detail {
 
 /// Special converters that can be used with 4 8-bit elements packed in a register.
@@ -2862,6 +3280,90 @@ namespace detail {
 }
 
 /////////////////////////////////////////////////////////////////////////////////////////////////
+
+#if defined(__CUDA_ARCH__)
+/// Partial specialization for Array<int8_t, 8> <= Array<int4b_t, 8>
+template <
+  FloatRoundStyle Round
+>
+struct NumericArrayConverter<int8_t, int4b_t, 8, Round> {
+
+  using result_type = Array<int8_t, 8>;
+  using source_type = Array<int4b_t, 8>;
+  static FloatRoundStyle const round_style = Round;
+
+  CUTLASS_DEVICE
+  static result_type convert(source_type const & source) {
+
+    unsigned const& storage = reinterpret_cast<unsigned const &>(source);
+    unsigned out[2];
+
+    asm volatile(
+        "{\n"
+        "  .reg .u32 tmp0, tmp1, tmp2;\n"
+        "  shl.b32 tmp0, %2, 4;\n"                // tmp0 = x1x2x3x4x5x6x7__
+        "  and.b32 tmp0, tmp0, 0xf0f0f0f0;\n"     // tmp0 = x1__x3__x5__x7__
+        "  prmt.b32 tmp1, tmp0, tmp0, 0xba98;\n"  // tmp1 = s1s3s5s7
+        "  and.b32 tmp1, tmp1, 0xf0f0f0f0;\n"     // tmp1 = s1__s3__s5__s7__
+        "  shr.u32 tmp0, tmp0, 4;\n"              // tmp0 = __x1__x3__x5__x7
+        "  or.b32 tmp2, tmp0, tmp1;\n"            // tmp2 = y1y3y5y7
+        "  and.b32 tmp0, %2, 0xf0f0f0f0;\n"       // tmp0 = x0__x2__x4__x6__
+        "  prmt.b32 tmp1, tmp0, tmp0, 0xba98;\n"  // tmp1 = s0s2s4s6
+        "  and.b32 tmp1, tmp1, 0xf0f0f0f0;\n"     // tmp1 = s0__s2__s4__s6__
+        "  shr.u32 tmp0, tmp0, 4;\n"              // tmp0 = __x0__x2__x4__x6
+        "  or.b32 tmp0, tmp0, tmp1;\n"            // tmp0 = y0y2y4y6
+        "  prmt.b32 %0, tmp2, tmp0, 0x5140;\n"    // %0 = y0y1y2y3
+        "  prmt.b32 %1, tmp2, tmp0, 0x7362;\n"    // %1 = y4y5y6y7
+        "}\n"
+        : "=r"(out[0]), "=r"(out[1])
+        : "r"(storage));
+
+    return reinterpret_cast<result_type const &>(out);
+  }
+
+  CUTLASS_DEVICE
+  result_type operator()(source_type const &s) const {
+    return convert(s);
+  }
+};
+
+/// Partial specialization for Array<int8_t> <= Array<int4b_t>
+template <
+  int N,
+  FloatRoundStyle Round
+>
+struct NumericArrayConverter<int8_t, int4b_t, N, Round> {
+  static_assert(!(N % 8), "N must be multiple of 8.");
+
+  using result_type = Array<int8_t, N>;
+  using source_type = Array<int4b_t, N>;
+  static FloatRoundStyle const round_style = Round;
+
+  CUTLASS_DEVICE
+  static result_type convert(source_type const & source) {
+
+    NumericArrayConverter<int8_t, int4b_t, 8, Round> convert_vector_;
+
+    result_type result;
+
+    Array<int8_t, 8> *result_ptr = reinterpret_cast<Array<int8_t, 8> *>(&result);
+    Array<int4b_t, 8> const *source_ptr = reinterpret_cast<Array<int4b_t, 8> const *>(&source);
+
+    CUTLASS_PRAGMA_UNROLL
+    for (int i = 0; i < N / 8; ++i) {
+      result_ptr[i] = convert_vector_(source_ptr[i]);
+    }
+
+    return result;
+  }
+
+  CUTLASS_DEVICE
+  result_type operator()(source_type const &s) const {
+    return convert(s);
+  }
+};
+#endif // defined(__CUDA_ARCH__)
+
 /// Partial specialization for Array<cutlass::float_e4m3_t, N> <= Array<cutlass::int4b_t, N>
 template <FloatRoundStyle Round, int N>
 struct NumericArrayConverter<cutlass::float_e4m3_t, cutlass::int4b_t, N, Round> {
@@ -3115,6 +3617,16 @@ struct NumericArrayConverter<float, int8_t, N, Round> {
     return reinterpret_cast<const uint32_t&>(source);
   }
 
+  CUTLASS_DEVICE
+  static int32_t to_int32(source_type_packed_2 const& source) {
+    return static_cast<int32_t>(reinterpret_cast<const int16_t&>(source));
+  }
+
+  CUTLASS_DEVICE
+  static int32_t to_int32(source_type_packed_4 const& source) {
+    return reinterpret_cast<const int32_t&>(source);
+  }
+
   template <typename PackedResultType, typename PackedSrcType>
   CUTLASS_DEVICE
   static PackedResultType packed_convert(PackedSrcType const &source) {
@@ -3126,6 +3638,8 @@ struct NumericArrayConverter<float, int8_t, N, Round> {
                   "Invalid PackedSrcType/PackedResultType must be 2 or 4 to use private convert dispatch.");
 
     PackedResultType r;
+  #if (defined __CUDA_ARCH__ && __CUDA_ARCH__ <= 800) || \
+       (defined __SYCL_CUDA_ARCH__ && __SYCL_CUDA_ARCH__ <= 800)
     // View the input as reg
     uint32_t src_reg = to_reg(source);
     static constexpr int fp32_base = 0x4B400000;
@@ -3143,6 +3657,21 @@ struct NumericArrayConverter<float, int8_t, N, Round> {
       result_as_int[ii] += fp32_base;
       r[ii] -= reinterpret_cast<const float&>(fp32_base);
     }
+  #else
+    int32_t x = to_int32(source);
+    int32_t t[4];
+    constexpr int32_t mask[4] = {0x00000001, 0x00000100, 0x00010000, 0x01000000};
+
+    CUTLASS_PRAGMA_UNROLL
+    for (int ii = 0; ii < PackedResultType::kElements; ++ii) {
+#if defined(CUTLASS_ENABLE_SYCL)
+      t[ii] = syclcompat::dp4a(x, mask[ii], 0);
+#else
+      t[ii] = __dp4a(x, mask[ii], 0);
+#endif
+      r[ii] = static_cast<float>(t[ii]);
+    }
+  #endif
 
     return r;
   }
diff --git a/include/cutlass/pipeline/sm90_pipeline.hpp b/include/cutlass/pipeline/sm90_pipeline.hpp
index f28fb3c033..290082eff7 100644
--- a/include/cutlass/pipeline/sm90_pipeline.hpp
+++ b/include/cutlass/pipeline/sm90_pipeline.hpp
@@ -30,13 +30,18 @@
  **************************************************************************************************/
 #pragma once
 
-#include "cutlass/cutlass.h"
-#include "cutlass/detail/dependent_false.hpp"
-#include "cute/numeric/integral_constant.hpp"
-#include "cute/arch/cluster_sm90.hpp"
-#include "cutlass/arch/barrier.h"
+#include "cute/layout.hpp"
+#include "cute/layout_composed.hpp"  // cute::composition
+#include "cute/swizzle.hpp"             // cute::Swizzle
+#include "cute/swizzle_layout.hpp"      // cute::composition
 #include "cute/util/type_traits.hpp"
+#include "cute/arch/cluster_sm90.hpp"
 #include "cute/container/array.hpp"
+#include "cute/numeric/integral_constant.hpp"
+
+#include "cutlass/cutlass.h"
+#include "cutlass/arch/barrier.h"
+#include "cutlass/detail/dependent_false.hpp"
 
 ////////////////////////////////////////////////////////////////////////////////////////////////////
 
@@ -233,7 +238,7 @@ public :
   };
 
   // Constructor
-  template<typename ClusterShape>
+  template<class ClusterShape>
   CUTLASS_DEVICE
   PipelineTmaAsync(SharedStorage& storage, Params params, ClusterShape cluster_shape)
       : params_(params)
@@ -294,7 +299,7 @@ public :
     is_signalling_thread_ &= is_same_row_or_col(dst_blockid_, block_id, cluster_shape);
   }
 
-  template <typename ClusterShape>
+  template <class ClusterShape>
   CUTLASS_DEVICE
   bool is_same_row_or_col(int dst_block_id, dim3 block_id, ClusterShape cluster_shape) {
     return (((dst_block_id % cute::size<0>(cluster_shape)) == block_id.x) ||
@@ -654,13 +659,12 @@ public :
     : params_(params)
     , full_barrier_ptr_(storage.full_barrier_.data())
     , empty_barrier_ptr_(storage.empty_barrier_.data()) {
-
     int warp_idx = canonical_warp_idx_sync();
     int lane_predicate = cute::elect_one_sync();
 
     // Barrier FULL, EMPTY init
     // Init is done only by thread 0 of the block
-    if (warp_idx == 0 && lane_predicate == 1) {
+    if (warp_idx == 0 && lane_predicate) {
       for (int i = 0; i < Stages; ++i) {
         full_barrier_ptr_[i].init(params.producer_arv_count);
         empty_barrier_ptr_[i].init(params.consumer_arv_count);
@@ -1099,7 +1103,7 @@ private :
 
     // Barrier FULL, EMPTY init
     // Init is done only by the one elected thread of the block
-    if (warp_idx == 0 && lane_predicate == 1) {
+    if (warp_idx == 0 && lane_predicate) {
       for (int d = 0; d < Depth; ++d) {
         for (int l = 0; l < Length; ++l) {
           barrier_ptr_[d * Length + l].init(params.group_size);
diff --git a/include/cutlass/platform/platform.h b/include/cutlass/platform/platform.h
index 5ef48efbde..77abf311f8 100644
--- a/include/cutlass/platform/platform.h
+++ b/include/cutlass/platform/platform.h
@@ -125,8 +125,6 @@
 
 #if defined(CUTLASS_ENABLE_SYCL)
 #include <cutlass/sycl_vector_types.h>
-#else
-#include <vector_types.h>
 #endif
 #include <cutlass/cutlass.h>
 
diff --git a/include/cutlass/reduction/device/reduce_split_k.h b/include/cutlass/reduction/device/reduce_split_k.h
index 4fa9bcca9a..0b8ac7a56b 100644
--- a/include/cutlass/reduction/device/reduce_split_k.h
+++ b/include/cutlass/reduction/device/reduce_split_k.h
@@ -36,6 +36,7 @@
 
 #include "cutlass/device_kernel.h"
 #include "cutlass/reduction/kernel/reduce_split_k.h"
+#include "cutlass/cuda_host_adapter.hpp"
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
 namespace cutlass {
@@ -64,6 +65,8 @@ class ReduceSplitK {
 
   using StrideIndex = typename ReductionKernel::StrideIndex;
 
+  static bool const kEnableCudaHostAdapter = CUTLASS_ENABLE_CUDA_HOST_ADAPTER;
+
   /// Argument structure
   struct Arguments {
 
@@ -174,7 +177,7 @@ class ReduceSplitK {
   }
 
   /// Runs the kernel using initialized state.
-  Status run(cudaStream_t stream = nullptr) {
+  Status run(cudaStream_t stream = nullptr, CudaHostAdapter *cuda_adapter = nullptr, int32_t kernel_index = 0) {
 
     //
     // Launch reduction kernel
@@ -182,29 +185,39 @@ class ReduceSplitK {
     dim3 block = ReductionKernel::block_shape();
     dim3 grid = ReductionKernel::grid_shape(params_.problem_size);
 
-    Kernel<ReductionKernel><<< grid, block, 0, stream >>>(params_);
+    if constexpr (kEnableCudaHostAdapter) {
+        CUTLASS_ASSERT(cuda_adapter);
+        if (cuda_adapter) {
+          void* kernel_params[] = {&params_};
+          cuda_adapter->launch(
+              grid, dim3(1,1,1), block, 0, stream, kernel_params, kernel_index);
+        }
+    }
+    else {
+      cutlass::arch::synclog_setup();
+      Kernel<ReductionKernel><<< grid, block, 0, stream >>>(params_);
+    }
 
     cudaError_t result = cudaGetLastError();
-
     return result == cudaSuccess ? Status::kSuccess : Status::kErrorInternal;
   }
 
 
   /// Runs the kernel using initialized state.
-  Status operator()(cudaStream_t stream = nullptr) {
-    return run(stream);
+  Status operator()(cudaStream_t stream = nullptr, CudaHostAdapter *cuda_adapter = nullptr, int32_t kernel_index = 0) {
+    return run(stream, cuda_adapter, kernel_index);
   }
 
   /// Runs the kernel using initialized state.
   Status operator()(
     Arguments const &args, 
     void *workspace = nullptr, 
-    cudaStream_t stream = nullptr) {
+    cudaStream_t stream = nullptr, CudaHostAdapter *cuda_adapter = nullptr, int32_t kernel_index = 0) {
     
     Status status = initialize(args, workspace, stream);
     
     if (status == Status::kSuccess) {
-      status = run(stream);
+      status = run(stream,cuda_adapter, kernel_index);
     }
 
     return status;
diff --git a/include/cutlass/reduction/device/tensor_reduce_affine_contiguous.h b/include/cutlass/reduction/device/tensor_reduce_affine_contiguous.h
index 1f2c4c3db8..8d71aa9dd3 100644
--- a/include/cutlass/reduction/device/tensor_reduce_affine_contiguous.h
+++ b/include/cutlass/reduction/device/tensor_reduce_affine_contiguous.h
@@ -323,6 +323,7 @@ struct TensorReductionAffineContiguous {
     int shared_mem_bytes = sizeof(typename ReductionKernel::SharedStorage);
 
     // Launch the kernel
+    cutlass::arch::synclog_setup();
     Kernel<ReductionKernel><<< grid_shape, threadblock_shape, shared_mem_bytes, stream >>>(params);
 
     // Check error condition
diff --git a/include/cutlass/reduction/device/tensor_reduce_affine_strided.h b/include/cutlass/reduction/device/tensor_reduce_affine_strided.h
index 258d2eb100..5ec7e65494 100644
--- a/include/cutlass/reduction/device/tensor_reduce_affine_strided.h
+++ b/include/cutlass/reduction/device/tensor_reduce_affine_strided.h
@@ -302,6 +302,7 @@ struct TensorReductionAffineStrided {
     int shared_mem_bytes = sizeof(typename ReductionKernel::SharedStorage);
 
     // Launch the kernel
+    cutlass::arch::synclog_setup();
     Kernel<ReductionKernel><<< grid_shape, threadblock_shape, shared_mem_bytes, stream >>>(params);
 
     // Check error condition
diff --git a/include/cutlass/subbyte_reference.h b/include/cutlass/subbyte_reference.h
index af697f62f8..8d43f503ee 100644
--- a/include/cutlass/subbyte_reference.h
+++ b/include/cutlass/subbyte_reference.h
@@ -1372,16 +1372,14 @@ struct ReferenceFactory<Element, true> {
   /// by the vector size.
   CUTLASS_HOST_DEVICE
   static Element *add_pointer_offset(Element *ptr, int64_t offset_in_elements) {
-
-    return ptr + offset_in_elements * sizeof_bits<Element>::value / sizeof(Element) / 8;
+    return &SubbyteReference<Element>(ptr, offset_in_elements);
   }
 
   /// Helper to add an offset in number of elements, assuming this offset is divisible
   /// by the vector size.
   CUTLASS_HOST_DEVICE
   static Element const *add_pointer_offset(Element const *ptr, int64_t offset_in_elements) {
-
-    return ptr + offset_in_elements * sizeof_bits<Element>::value / sizeof(Element) / 8;
+    return &ConstSubbyteReference<Element>(ptr, offset_in_elements);
   }
 };
 
diff --git a/include/cutlass/tensor_ref.h b/include/cutlass/tensor_ref.h
index 2fef54704f..1191f651e5 100644
--- a/include/cutlass/tensor_ref.h
+++ b/include/cutlass/tensor_ref.h
@@ -332,7 +332,8 @@ class TensorRef {
   /// Adds an offset to each pointer
   CUTLASS_HOST_DEVICE
   TensorRef & add_pointer_offset(LongIndex offset_) {
-    ptr_ += offset_;
+    ptr_ = ReferenceFactory<typename platform::remove_const<Element>::type,
+           (sizeof_bits<Element>::value < 8)>::add_pointer_offset(ptr_, offset_);
     return *this;
   }
 
diff --git a/include/cutlass/tfloat32.h b/include/cutlass/tfloat32.h
index 259a4ba180..4ff36f1194 100644
--- a/include/cutlass/tfloat32.h
+++ b/include/cutlass/tfloat32.h
@@ -466,12 +466,12 @@ tfloat32_t operator--(tfloat32_t & lhs, int) {
 //
 
 CUTLASS_HOST_DEVICE
-cutlass::tfloat32_t operator "" _tf32(long double x) {
+cutlass::tfloat32_t operator ""_tf32(long double x) {
   return cutlass::tfloat32_t(float(x));
 }
 
 CUTLASS_HOST_DEVICE
-cutlass::tfloat32_t operator "" _tf32(unsigned long long int x) {
+cutlass::tfloat32_t operator ""_tf32(unsigned long long int x) {
   return cutlass::tfloat32_t(int(x));
 }
 
diff --git a/include/cutlass/transform/device/transform_universal_adapter.hpp b/include/cutlass/transform/device/transform_universal_adapter.hpp
index 5fc5ab2d94..c7ab0ceb07 100644
--- a/include/cutlass/transform/device/transform_universal_adapter.hpp
+++ b/include/cutlass/transform/device/transform_universal_adapter.hpp
@@ -28,7 +28,6 @@
  * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  *
  **************************************************************************************************/
-
 /*! \file
   \brief Transform Kernel Universal adapter
 */
@@ -37,12 +36,25 @@
 
 // common
 #include "cutlass/cutlass.h"
-#include "cutlass/cluster_launch.hpp"
 #include "cutlass/device_kernel.h"
+#include "cutlass/gemm/gemm.h"
+#include "cutlass/detail/layout.hpp"
+#include "cutlass/detail/mma.hpp"
 #include "cutlass/cuda_host_adapter.hpp"
 
+#include "cutlass/kernel_launch.h"
+#if !defined(__CUDACC_RTC__)
+#include "cutlass/cluster_launch.hpp"
+#include "cutlass/trace.h"
+#endif // !defined(__CUDACC_RTC__)
+
+
+////////////////////////////////////////////////////////////////////////////////
+
 namespace cutlass::transform::device {
 
+////////////////////////////////////////////////////////////////////////////////
+
 template <class TransformKernel_>
 class TransformUniversalAdapter
 {
@@ -50,58 +62,73 @@ class TransformUniversalAdapter
   using TransformKernel = TransformKernel_;
   using Arguments = typename TransformKernel::Arguments;
   using Params = typename TransformKernel::Params;
+  static bool const kEnableCudaHostAdapter = CUTLASS_ENABLE_CUDA_HOST_ADAPTER;
+
 
 private:
+
+  /// Kernel API parameters object
   Params params_;
-  static constexpr bool const EnableCudaHostAdapter = CUTLASS_ENABLE_CUDA_HOST_ADAPTER;
 
 public:
+
+  /// Access the Params structure
   Params const& params() const {
-    return this->params_;
+    return params_;
   }
 
+  /// Determines whether the GEMM can execute the given problem.
   static Status
   can_implement(Arguments const& args) {
     return TransformKernel::can_implement(args);
   }
 
+  /// Gets the workspace size
   static size_t
   get_workspace_size(Arguments const& args) {
-    return TransformKernel::get_workspace_size(args);
+    size_t workspace_bytes = 0;
+    workspace_bytes += TransformKernel::get_workspace_size(args);
+
+    CUTLASS_TRACE_HOST("  workspace_bytes: " << workspace_bytes);
+
+    return workspace_bytes;
   }
 
+  /// Computes the grid shape
   static dim3
   get_grid_shape(Arguments const& args, void* workspace = nullptr) {
     auto tmp_params = TransformKernel::to_underlying_arguments(args, workspace);
     return TransformKernel::get_grid_shape(tmp_params);
   }
 
+  /// Computes the grid shape
   static dim3
   get_grid_shape(Params const& params) {
     return TransformKernel::get_grid_shape(params);
   }
 
+
+  /// Initializes GEMM state from arguments.
   Status
   initialize(
-    Arguments & args,
+    Arguments const& args,
     void* workspace = nullptr,
     cudaStream_t stream = nullptr,
-    CudaHostAdapter *cuda_adapter = nullptr) {
+    CudaHostAdapter* cuda_adapter = nullptr) {
 
     CUTLASS_TRACE_HOST("TransformUniversalAdapter::initialize() - workspace "
-      << workspace << ", stream: " << (stream ? "non-null" : "null"));
+      << workspace << ", stream: " << (stream ? "non-null" : "null")
+      << ", EnableCudaHostAdapter: " << (kEnableCudaHostAdapter ? "True" : "false"));
 
     // Initialize the workspace
     Status status = TransformKernel::initialize_workspace(args, workspace, stream, cuda_adapter);
     if (status != Status::kSuccess) {
       return status;
     }
-
     // Initialize the Params structure
-    this->params_ = TransformKernel::to_underlying_arguments(args, workspace);
-
+    params_ = TransformKernel::to_underlying_arguments(args, workspace);
     // Don't set the function attributes - require the CudaHostAdapter to set it.
-    if constexpr (EnableCudaHostAdapter) {
+    if constexpr (kEnableCudaHostAdapter) {
       CUTLASS_ASSERT(cuda_adapter);
       return Status::kSuccess;
     }
@@ -116,50 +143,59 @@ class TransformUniversalAdapter
       if (smem_size >= (48 << 10)) {
         CUTLASS_TRACE_HOST("  Setting smem size to " << smem_size);
         cudaError_t result = cudaFuncSetAttribute(
-          device_kernel<TransformKernel>,
-          cudaFuncAttributeMaxDynamicSharedMemorySize,
-          smem_size);
+            device_kernel<TransformKernel>,
+            cudaFuncAttributeMaxDynamicSharedMemorySize,
+            smem_size);
         if (cudaSuccess != result) {
-          result = cudaGetLastError();
+          result = cudaGetLastError(); // to clear the error bit
           CUTLASS_TRACE_HOST("  cudaFuncSetAttribute() returned error: " << cudaGetErrorString(result));
           return Status::kErrorInternal;
         }
       }
     }
-
     return Status::kSuccess;
   }
 
   static Status
-  run(
-    Params& params,
-    cudaStream_t stream = nullptr,
-    CudaHostAdapter *cuda_adapter = nullptr,
-    int32_t kernel_index = 0) {
-
+  run(Params& params,
+      cudaStream_t stream = nullptr,
+      CudaHostAdapter *cuda_adapter = nullptr,
+      int32_t kernel_index = 0,
+      bool launch_with_pdl = false) {
     CUTLASS_TRACE_HOST("TransformUniversalAdapter::run()");
     dim3 const block = TransformKernel::get_block_shape();
     dim3 const grid = get_grid_shape(params);
-    // Currently only support 1x1x1 for transform kernel.
-    dim3 const cluster = {1,1,1};
 
     // configure smem size and carveout
     int smem_size = TransformKernel::SharedStorageSize;
 
-    Status launch_result;
-
+    Status launch_result{ Status::kSuccess };
     // Use extended launch API only for mainloops that use it
-    if constexpr(TransformKernel::ArchTag::kMinComputeCapability >= 90) {
+    if constexpr (TransformKernel::ArchTag::kMinComputeCapability >= 90) {
+      // Currently only support 1x1x1 for transform kernel.
+      dim3 const cluster = {1,1,1};
       void* kernel_params[] = {&params};
 
-      if constexpr (EnableCudaHostAdapter) {
+      if constexpr (kEnableCudaHostAdapter) {
         //
         // Use the cuda host adapter
         //
         CUTLASS_ASSERT(cuda_adapter);
         if (cuda_adapter) {
-          launch_result = cuda_adapter->launch(
-            grid, cluster, block, smem_size, stream, kernel_params, kernel_index);
+
+          if (launch_with_pdl) {
+            CUTLASS_TRACE_HOST(
+              "TransformUniversalAdapter::run() does not support launching with PDL and a custom cuda adapter.");
+            return Status::kErrorInternal;
+          }
+          launch_result = cuda_adapter->launch(grid,
+                                               cluster,
+                                               block,
+                                               smem_size,
+                                               stream,
+                                               kernel_params,
+                                               kernel_index);
+          CUTLASS_TRACE_HOST("Kernel Launch Result" << cutlassGetStatusString(launch_result));
         }
         else {
           return Status::kErrorInternal;
@@ -168,18 +204,25 @@ class TransformUniversalAdapter
       else {
         CUTLASS_ASSERT(cuda_adapter == nullptr);
         void const* kernel = (void const*) device_kernel<TransformKernel>;
-        launch_result = ClusterLauncher::launch(
-          grid, cluster, block, smem_size, stream, kernel, kernel_params);
+        if constexpr (TransformKernel::ArchTag::kMinComputeCapability == 90) {
+          launch_result = ClusterLauncher::launch(
+            grid, cluster, block, smem_size, stream, kernel, kernel_params, launch_with_pdl);
+        }
       }
     }
     else {
       launch_result = Status::kSuccess;
-      if constexpr (EnableCudaHostAdapter) {
+      cutlass::arch::synclog_setup();
+
+      if constexpr (kEnableCudaHostAdapter) {
         CUTLASS_ASSERT(cuda_adapter);
         if (cuda_adapter) {
           void* kernel_params[] = {&params};
+
           launch_result = cuda_adapter->launch(
-            grid, block, smem_size, stream, kernel_params, 0);
+            grid, block, smem_size, stream, kernel_params, 0
+          );
+
         }
         else {
           return Status::kErrorInternal;
@@ -187,56 +230,74 @@ class TransformUniversalAdapter
       }
       else {
         CUTLASS_ASSERT(cuda_adapter == nullptr);
-        device_kernel<TransformKernel><<<grid, block, smem_size, stream>>>(params);
+        cutlass::kernel_launch<TransformKernel>(grid, block, smem_size, stream, params, launch_with_pdl);
       }
     }
 
     cudaError_t result = cudaGetLastError();
     if (cudaSuccess == result && Status::kSuccess == launch_result) {
       return Status::kSuccess;
-    } else {
-      CUTLASS_TRACE_HOST("  Kernel launch failed. Reason: " << result);
-      return Status::kErrorInternal;
     }
+    else if (cudaSuccess != result) {
+      CUTLASS_TRACE_HOST("  Kernel launch failed. Reason: " << cudaGetErrorString(result));
+    }
+    else if (Status::kSuccess != launch_result) {
+      CUTLASS_TRACE_HOST("  Kernel launch failed. Reason: " << cutlassGetStatusString(launch_result));
+    }
+    return Status::kErrorInternal;
   }
 
+  //
+  // Non-static launch overloads that first create and set the internal params struct of this kernel handle.
+  //
+
+  /// Launches the kernel after first constructing Params internal state from supplied arguments.
   Status
   run(
-    Arguments & args,
+    Arguments const& args,
     void* workspace = nullptr,
     cudaStream_t stream = nullptr,
     CudaHostAdapter *cuda_adapter = nullptr,
-    int32_t kernel_index = 0) {
-
+    int32_t kernel_index = 0,
+    bool launch_with_pdl = false
+  ) {
     Status status = initialize(args, workspace, stream, cuda_adapter);
+
     if (Status::kSuccess == status) {
-      status = run(this->params_, stream, cuda_adapter, kernel_index);
+      status = run(params_, stream, cuda_adapter, kernel_index, launch_with_pdl);
     }
     return status;
   }
 
+  /// Launches the kernel after first constructing Params internal state from supplied arguments.
   Status
   operator()(
-    Arguments & args,
+    Arguments const& args,
     void* workspace = nullptr,
     cudaStream_t stream = nullptr,
-    CudaHostAdapter *cuda_adapter = nullptr) {
-    return run(args, workspace, stream, cuda_adapter);
+    CudaHostAdapter *cuda_adapter = nullptr,
+    bool launch_with_pdl = false) {
+    return run(args, workspace, stream, cuda_adapter, 0 /*kernel_index*/, launch_with_pdl);
   }
 
+  /// Overload that allows a user to re-launch the same kernel without updating internal params struct.
   Status
   run(
-    cudaStream_t stream = nullptr, 
-    CudaHostAdapter *cuda_adapter = nullptr) {
-    return run(this->params_, stream, cuda_adapter);
+    cudaStream_t stream = nullptr,
+    CudaHostAdapter *cuda_adapter = nullptr,
+    bool launch_with_pdl = false) {
+    return run(params_, stream, cuda_adapter, 0 /*kernel_index*/, launch_with_pdl);
   }
 
+  /// Overload that allows a user to re-launch the same kernel without updating internal params struct.
   Status
-  operator()(
-    cudaStream_t stream = nullptr,
-    CudaHostAdapter *cuda_adapter = nullptr) {
-    return run(this->params_, stream, cuda_adapter);
+  operator()(cudaStream_t stream = nullptr, CudaHostAdapter *cuda_adapter = nullptr, bool launch_with_pdl = false) {
+    return run(params_, stream, cuda_adapter, 0 /*kernel_index*/, launch_with_pdl);
   }
 };
 
+////////////////////////////////////////////////////////////////////////////////
+
 } // namespace cutlass::transform::device
+
+////////////////////////////////////////////////////////////////////////////////
diff --git a/include/cutlass/transform/kernel/filter_format_transformer.hpp b/include/cutlass/transform/kernel/filter_format_transformer.hpp
index 7538f2f432..9f54c93f1a 100644
--- a/include/cutlass/transform/kernel/filter_format_transformer.hpp
+++ b/include/cutlass/transform/kernel/filter_format_transformer.hpp
@@ -61,13 +61,15 @@ template <
   FilterFormat SrcFormat,
   FilterFormat DstFormat,
   int NumDimensions,
-  class Element,
+  class Element_,
   int AlignmentBytes = 16
 >
 struct ConvFilterFormatTransformer {
+  
+  using Element = Element_;
   static_assert(SrcFormat == FilterFormat::CKTRS, "Currently only source format of CKTRS is supported");
   static_assert(DstFormat == FilterFormat::CTRSK || DstFormat == FilterFormat::KTRSC, "Currently only destination format of CTRSK/KTRSC is supported");
-  static_assert(AlignmentBytes % static_cast<int>(sizeof(Element)) == 0, "Invalid alignment setting");
+  static_assert(AlignmentBytes > 0 && AlignmentBytes % static_cast<int>(sizeof(Element)) == 0, "Invalid alignment setting");
 
   // In ktrsc order.
   using FilterExtent = array<int, NumDimensions>;
@@ -108,6 +110,20 @@ struct ConvFilterFormatTransformer {
 
   static Status
   can_implement(Arguments const& args) {
+    bool implementable = true;
+    // alignment rule
+    {
+      int contiguous_dim = DstFormat == FilterFormat::CTRSK ? args.filter_extent[0] : args.filter_extent[NumDimensions - 1];
+      int align_element = AlignmentBytes / static_cast<int>(sizeof(Element));
+
+      implementable &= (contiguous_dim % align_element == 0);
+
+      if (!implementable) {
+        CUTLASS_TRACE_HOST("  CAN IMPLEMENT: Alignment setting is invalid.\n");
+        return Status::kInvalid;
+      }
+    }
+
     return Status::kSuccess;
   }
 
@@ -136,7 +152,7 @@ struct ConvFilterFormatTransformer {
   }
 
   static Params
-  to_underlying_arguments(Arguments & args, void* workspace) {
+  to_underlying_arguments(Arguments const& args, void* workspace) {
     auto k = args.filter_extent[0];
     auto c = args.filter_extent[NumDimensions - 1];
     auto srt = reverse(take<1,NumDimensions - 1>(args.filter_extent));
@@ -192,9 +208,11 @@ struct ConvFilterFormatTransformer {
       auto kc_coord = DstFormat == FilterFormat::CTRSK ?
           make_coord(n_idx+i, get<NumDimensions - 2>(srtc_coord)) :
           make_coord(get<NumDimensions - 2>(srtc_coord), n_idx+i);
-      auto coord = flatten(make_coord(srt_coord, kc_coord));
-      frag(i) = params.src(coord);
+      auto coord = flatten(make_coord(srt_coord, kc_coord)); 
       thr_tile_P(i) = elem_less(coord, shape(params.src));
+      if (thr_tile_P(i)) {
+        frag(i) = params.src(coord);
+      }
     }
 
     // Copy from RMEM to GMEM
diff --git a/include/cutlass/transform/kernel/sm90_sparse_gemm_compressor.hpp b/include/cutlass/transform/kernel/sm90_sparse_gemm_compressor.hpp
new file mode 100644
index 0000000000..86891bd7c6
--- /dev/null
+++ b/include/cutlass/transform/kernel/sm90_sparse_gemm_compressor.hpp
@@ -0,0 +1,578 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+/*! \file
+  \brief Compress utils specific for SM90 structure sparse kernels
+*/
+
+#pragma once
+
+#include "cute/container/bit_field.hpp"    // cute::bit_field
+#include "cute/numeric/numeric_types.hpp"  // cute::sizeof_bits_v, cute::uint_bit_t
+#include "cute/tensor.hpp"                 // cute::Tensor, cute::make_tensor
+#include "cute/algorithm/cooperative_copy.hpp" // cute::cooperative_copy
+#include "cutlass/arch/arch.h"             // cutlass::arch::Sm90
+#include "cutlass/cuda_host_adapter.hpp"   // cutlass::CudaHostAdapter
+#include "cutlass/cutlass.h"               // cutlass::Status
+#include "cutlass/gemm/gemm.h"             // cutlass::TagToStrideA_t
+#include "cutlass/fast_math.h"             // cutlass::ceil_div, cutlass::round_up
+#include "cutlass/kernel_hardware_info.h"  // cutlass::KernelHardwareInfo
+#include "cutlass/numeric_size.h"          // cutlass::bits_to_bytes
+#include "cutlass/cuda_host_adapter.hpp"   // cutlass::CudaHostAdapter
+
+namespace cutlass::transform::kernel {
+
+using namespace cute;
+
+template<
+  class ProblemShape_,
+  class ElementA_,
+  class LayoutATag_,
+  class SparseConfig_
+>
+class SM90StructuredSparseCompressor {
+public:
+  using SparseConfig = SparseConfig_;
+  using ProblemShape = ProblemShape_;
+
+  // * EltA
+  using ElementA = ElementA_;
+  using ElementAUint = cute::uint_bit_t<cute::sizeof_bits_v<ElementA>>;
+  using ElementAMma = typename SparseConfig::ElementAMma;
+  using ElementAMmaRaw = typename SparseConfig::ElementAMmaRaw;
+  using ElementAMmaRawUnit = cute::uint_bit_t<cute::sizeof_bits_v<ElementAMmaRaw>>;
+  using ElementASparsity = typename SparseConfig::ElementASparsity;
+  using ElementAMmaSparsity = typename SparseConfig::ElementAMmaSparsity;
+  using ElementAUintCompressed = cute::sparse_elem<ElementASparsity{}, ElementAUint>;
+  using LayoutATag = LayoutATag_;
+  using LayoutA = LayoutATag;
+  using StrideA = cutlass::gemm::TagToStrideA_t<LayoutATag>;
+
+  // * EltE
+  using ElementEMma = typename SparseConfig::ElementEMma;
+  using ElementEMmaRaw = typename SparseConfig::ElementEMmaRaw;
+  using ElementEMmaSparsity = typename SparseConfig::ElementEMmaSparsity;
+  // Data Type for storing one chunk's metadata
+  static constexpr int ElementEBitsPerChunk = typename SparseConfig::ElementEBitsPerChunk{};
+  CUTE_STATIC_ASSERT(ElementEBitsPerChunk == 4, "ElementEBitsPerChunk is 4 for SM90");
+  using ElementEChunk = cute::uint_bit_t<ElementEBitsPerChunk>;
+  CUTE_STATIC_ASSERT(cute::is_same_v<ElementEChunk, cute::uint4_t>, "ElementEChunk is uint4_t for SM90");
+  using ElementESparsityPerChunk = Int<ElementEMmaSparsity{} / (cute::sizeof_bits_v<ElementEMmaRaw> / ElementEBitsPerChunk)>;
+
+  // AtomE
+  using TensorEAtom = typename SparseConfig::TensorEAtom;
+  using TensorEAtomK = typename SparseConfig::TensorEAtomK;
+  using TensorEAtomM = typename SparseConfig::TensorEAtomM;
+
+  static constexpr int ElemsARawPerElementAMmaRaw = typename SparseConfig::ElemsARawPerElementAMmaRaw{};
+  static constexpr int LogicalElemsAPerChunk = typename SparseConfig::LogicalElemsAPerChunk{};
+  static constexpr int PhysicalElemsAPerChunk = typename SparseConfig::PhysicalElemsAPerChunk{};
+  static constexpr int LogicalElemsAMmaRawPerChunk = cutlass::ceil_div(LogicalElemsAPerChunk, ElemsARawPerElementAMmaRaw);
+  static constexpr int PhysicalElemsAMmaRawPerChunk = cutlass::ceil_div(PhysicalElemsAPerChunk, ElemsARawPerElementAMmaRaw);
+
+  // * Alignment
+  static constexpr int TensorEAlignmentM = typename SparseConfig::TensorEAlignmentM{};
+  static constexpr int TensorEAlignmentK = typename SparseConfig::TensorEAlignmentK{};
+  static constexpr int TensorAAlignmentK = typename SparseConfig::TensorAAlignmentK{};
+  static constexpr int TensorAAlignmentM = typename SparseConfig::TensorAAlignmentM{};
+
+  // Required by `device_kernel`
+  static constexpr int MaxThreadsPerBlock = TensorEAtomM{};
+  static constexpr int MinBlocksPerMultiprocessor = 1;
+  using ArchTag = arch::Sm90;
+
+  struct SharedStorage {
+    ElementEMma cEsE[cute::size(TensorEAtom{})];
+    ElementAUintCompressed cACsAC[cute::size(TensorEAtom{})];
+    ElementAUint cAsA[cute::size(TensorEAtom{})];
+  };
+
+  static constexpr int SharedStorageSize = sizeof(SharedStorage);
+
+  struct TransformArguments {
+    void const* ptr_A{nullptr};
+    StrideA dA{};
+    void* ptr_ACompress{nullptr};
+    void* ptr_E{nullptr};
+  };
+
+  using TransformParams = TransformArguments;
+
+  struct Arguments {
+    ProblemShape problem_shape{};
+    TransformArguments transform{};
+    KernelHardwareInfo hw_info{};
+  };
+
+  struct Params {
+    ProblemShape problem_shape{};
+    TransformParams transform{};
+    KernelHardwareInfo hw_info{};
+    void* workspace = nullptr;
+  };
+
+public:
+  static Params
+  to_underlying_arguments(Arguments const& args, void* workspace = nullptr) {
+    CUTLASS_TRACE_HOST("SM90StructuredSparseCompressor::to_underlying_arguments()");
+    return Params{{args.problem_shape},
+                  {args.transform.ptr_A, args.transform.dA, args.transform.ptr_ACompress, args.transform.ptr_E},
+                  {args.hw_info},
+                  workspace};
+  }
+
+  static Status
+  can_implement(Arguments const& args) {
+    auto [M, N, K, L] = args.problem_shape;
+    if (K % LogicalElemsAPerChunk != 0) {
+      CUTLASS_TRACE_HOST("SM90 Sparse Compressor CAN NOT IMPLEMENT: GemmK not multiplier of logical chunk size");
+      return Status::kErrorInvalidProblem;
+    }
+    CUTLASS_TRACE_HOST("SM90StructuredSparseCompressor::can_implement() (True)");
+    return Status::kSuccess;
+  }
+
+  static size_t
+  get_workspace_size(Arguments const& args) {
+    CUTLASS_UNUSED(args);
+    // Backward compatible with host compressor
+    CUTLASS_TRACE_HOST("SM90StructuredSparseCompressor::get_workspace_size() (" << SharedStorageSize << ")");
+    return SharedStorageSize;
+  }
+
+  static Status
+  initialize_workspace(Arguments const& args, void* workspace = nullptr, cudaStream_t stream = nullptr,
+    CudaHostAdapter *cuda_adapter = nullptr) {
+    CUTLASS_UNUSED(args);
+    CUTLASS_UNUSED(workspace);
+    CUTLASS_UNUSED(stream);
+    CUTLASS_UNUSED(cuda_adapter);
+    CUTLASS_TRACE_HOST("SM90StructuredSparseCompressor::initialize_workspace()");
+    return Status::kSuccess;
+  }
+
+  static dim3
+  get_grid_shape(Params const& params) {
+    constexpr int MaxAlignmentM = cutlass::const_max(TensorEAlignmentM, TensorAAlignmentM);
+    constexpr int MaxAlignmentK = cutlass::const_max(TensorEAlignmentK, TensorAAlignmentK);
+    const auto [GemmM, GemmN, GemmK, GemmL] = params.problem_shape;
+
+    const int GemmMAlignedMax = cutlass::round_up(GemmM, MaxAlignmentM);
+    const int GemmKAlignedMax = cutlass::round_up(GemmK, MaxAlignmentK);
+
+    const int gridDim_X = cutlass::ceil_div(GemmMAlignedMax, TensorEAtomM{});
+    const int gridDim_Y = cutlass::ceil_div(GemmKAlignedMax, TensorEAtomK{});
+    const int gridDim_Z = GemmL;
+
+    CUTLASS_TRACE_HOST("SM90StructuredSparseCompressor::get_grid_shape() ("
+      << gridDim_X << ", "
+      << gridDim_Y << ", "
+      << gridDim_Z << ")");
+    return dim3(gridDim_X, gridDim_Y, gridDim_Z);
+  }
+
+  static dim3
+  get_block_shape() {
+    CUTLASS_TRACE_HOST("SM90StructuredSparseCompressor::get_block_shape() ("
+      << MaxThreadsPerBlock << ", "
+      << 1 << ", "
+      << 1 << ")");
+    return dim3(MaxThreadsPerBlock, 1, 1);
+  }
+
+  CUTE_DEVICE
+  void
+  operator()(Params params, void* smem_buf = nullptr) {
+    run(params, smem_buf);
+  }
+
+  CUTE_DEVICE
+  static void
+  run(Params params, void* smem_buf = nullptr) {
+    structure_sparse_compress(params, smem_buf);
+  }
+
+private:
+
+  struct MetadataOneChunk1to2 {
+
+    CUTE_DEVICE
+    void set_metadata_bits(int elt_log_idx, int elt_phy_idx) {
+      auto metadata_bits = [&]() -> uint8_t {
+        CUTLASS_ASSERT(elt_log_idx >= 0 && elt_log_idx < 2);
+        switch (elt_log_idx) {
+          case 0:
+            return 0b0100;
+          case 1:
+            return 0b1110;
+          default:
+            CUTE_GCC_UNREACHABLE;
+        }
+      };
+
+      storage_ |= (metadata_bits() << (4 * elt_phy_idx));
+    }
+
+
+    CUTE_DEVICE
+    ElementEChunk storage() const {
+      return ElementEChunk{storage_};
+    }
+
+  private:
+    uint8_t storage_ = 0b0000;
+  };
+
+  struct MetadataOneChunk2to4{
+
+    CUTE_DEVICE
+    void set_metadata_bits(int elt_log_idx, int elt_phy_idx) {
+      auto metadata_bits = [&]() -> uint8_t {
+        CUTLASS_ASSERT(elt_log_idx >= 0 && elt_log_idx < 4);
+        switch (elt_log_idx) {
+          case 0:
+            return 0b00;
+          case 1:
+            return 0b01;
+          case 2:
+            return 0b10;
+          case 3:
+            return 0b11;
+          default:
+            CUTE_GCC_UNREACHABLE;
+        }
+      };
+
+      storage_ |= (metadata_bits() << (2 * elt_phy_idx));
+    }
+
+    CUTE_DEVICE
+    ElementEChunk storage() const {
+      return ElementEChunk{storage_};
+    }
+
+  private:
+    uint8_t storage_ = 0b0000;
+  };
+
+  using MetadataOneChunk = cute::conditional_t<SparseConfig::IsTfmma,
+                                               MetadataOneChunk1to2,
+                                               MetadataOneChunk2to4>;
+
+private:
+
+  CUTE_DEVICE
+  static void
+  structure_sparse_compress(Params params, void* smem_buf) {
+    // * Input Params
+    auto [GemmM, GemmN, GemmK, GemmL] = params.problem_shape;
+    auto [ptr_A, dA, ptr_ACompress, ptr_E] = params.transform;
+    SharedStorage& shared_storage = *reinterpret_cast<SharedStorage*>(smem_buf);
+
+    [[maybe_unused]] const int gridDim_X = GridDimX();
+    [[maybe_unused]] const int gridDim_Y = GridDimY();
+    [[maybe_unused]] const int gridDim_Z = GridDimZ();
+    [[maybe_unused]] const int blockDim_X = BlockDimX();
+
+    // * Global Tensor Layout
+    const cute::Layout layout_gA = make_layout(make_shape(GemmM, GemmK, GemmL), dA);
+    const cute::Layout layout_gAC = SparseConfig::fill_layoutA(params.problem_shape);
+    const cute::Layout layout_gE = SparseConfig::fill_layoutE(params.problem_shape);
+
+    // * Construct Global Tensor
+    const cute::Tensor gA   = make_tensor(make_gmem_ptr(cute::recast_ptr<ElementAUint>(ptr_A)), layout_gA);
+    cute::Tensor gAC_sparse = make_tensor(make_gmem_ptr(cute::recast_ptr<ElementAUintCompressed>(ptr_ACompress)), layout_gAC );
+    cute::Tensor gAC        = cute::recast<ElementAUint>(gAC_sparse);
+    cute::Tensor gE_sparse  = make_tensor(make_gmem_ptr(cute::recast_ptr<ElementEMma>(ptr_E)), layout_gE);
+    cute::Tensor gE         = cute::recast<ElementEMmaRaw>(gE_sparse);
+
+    // * CTA Tensor Layout
+    using cAsA_layout_row = decltype(make_layout(make_shape(TensorEAtomM{}, TensorEAtomK{}), LayoutRight{}));
+    using cAsA_layout_col = decltype(make_layout(make_shape(TensorEAtomM{}, TensorEAtomK{}), LayoutLeft{}));
+    using cAsA_layout     = cute::conditional_t<cute::is_same_v<LayoutATag, layout::RowMajor>, cAsA_layout_row, cAsA_layout_col>;
+    using cACsAC_layout   = decltype(make_layout(make_shape(TensorEAtomM{}, TensorEAtomK{} / ElementASparsity{}), LayoutRight{}));
+    using cEsE_layout     = decltype(make_layout(make_shape(TensorEAtomM{}, TensorEAtomK{} / ElementEMmaSparsity{}), LayoutRight{}));
+
+    CUTE_STATIC_ASSERT(cute::is_static_v<TensorEAtom>, "TensorEAtom needs to be static");
+    CUTE_STATIC_ASSERT(cute::is_static_v<cAsA_layout>, "cAsA_layout needs to be static");
+    CUTE_STATIC_ASSERT(cute::is_static_v<cACsAC_layout>, "cACsAC_layout needs to be static");
+    CUTE_STATIC_ASSERT(cute::is_static_v<cEsE_layout>, "cEsE_layout needs to be static");
+
+    const int blockIdx_X = BlockIdxX();
+    const int blockIdx_Y = BlockIdxY();;
+    const int blockIdx_Z = BlockIdxZ();;
+    const int threadIdx_X = ThreadIdxX();;
+
+    // * Construct CTA Tensor
+    const auto cta_coord = make_coord(blockIdx_X, blockIdx_Y, blockIdx_Z);
+    cute::Tensor cAgA   = cute::recast<ElementAMmaRawUnit>(local_tile(gA, shape(cAsA_layout{}), cta_coord));
+    cute::Tensor cACgAC = cute::recast<ElementAMmaRawUnit>(local_tile(gAC, shape(cACsAC_layout{}), cta_coord));
+    cute::Tensor cEgE   = local_tile(gE, shape(cEsE_layout{}), cta_coord);
+
+    cute::Tensor cAsA   = cute::recast<ElementAMmaRawUnit>(make_tensor(make_smem_ptr(cute::recast_ptr<ElementAUint>(shared_storage.cAsA)), cAsA_layout{}));
+    cute::Tensor cACsAC = cute::recast<ElementAMmaRawUnit>(make_tensor(make_smem_ptr(cute::recast_ptr<ElementAUint>(shared_storage.cACsAC)), cACsAC_layout{}));
+    cute::Tensor cEsE   = make_tensor(make_smem_ptr(cute::recast_ptr<ElementEMmaRaw>(shared_storage.cEsE)), cEsE_layout{});
+    cute::Tensor cEsE_chunk = cute::recast<ElementEChunk>(cEsE);
+
+    // * Handle in unit of Chunk when compress
+    using OneChunkSizeA  = Int<LogicalElemsAMmaRawPerChunk>;
+    using OneChunkSizeAC = Int<PhysicalElemsAMmaRawPerChunk>;
+    using OneChunkSizeE  = Int<LogicalElemsAPerChunk / ElementESparsityPerChunk{}>;
+    using NumOneChunkK   = Int<cutlass::ceil_div(TensorEAtomK{}, LogicalElemsAPerChunk)>;
+
+    cute::Tensor cAsA_log_chunk   = logical_divide(cAsA, make_shape(_, OneChunkSizeA{}));
+    cute::Tensor cACsAC_log_chunk = logical_divide(cACsAC, make_shape(_, OneChunkSizeAC{}));
+    cute::Tensor cEsE_log_chunk   = logical_divide(cEsE_chunk, make_shape(_, OneChunkSizeE{}));
+
+    // * Corner Case Handle
+    const auto GemmM_within_Cta = (GemmM - blockIdx_X * TensorEAtomM{} > TensorEAtomM{}) ? TensorEAtomM{} : GemmM - blockIdx_X * TensorEAtomM{};
+    const auto GemmK_within_Cta = ( (GemmK - blockIdx_Y * TensorEAtomK{} > TensorEAtomK{}) ? TensorEAtomK{} : GemmK - blockIdx_Y * TensorEAtomK{} ) / ElemsARawPerElementAMmaRaw;
+    const auto GemmK_NumOneChunk_within_Cta = GemmK_within_Cta / LogicalElemsAMmaRawPerChunk;
+
+    const auto GemmMAlignedAC = cutlass::round_up(GemmM, TensorAAlignmentM);
+    const auto GemmKAlignedAC = cutlass::round_up(GemmK, TensorAAlignmentK);
+    const auto GemmMAlignedAC_within_Cta = (GemmMAlignedAC - blockIdx_X * TensorEAtomM{} > TensorEAtomM{}) ? TensorEAtomM{} : GemmMAlignedAC - blockIdx_X * TensorEAtomM{};
+    const auto GemmKAlignedAC_within_Cta = ( (GemmKAlignedAC - blockIdx_Y * TensorEAtomK{} > TensorEAtomK{}) ? TensorEAtomK{} : GemmKAlignedAC - blockIdx_Y * TensorEAtomK{} ) / ElemsARawPerElementAMmaRaw;
+
+    // * Clear CTA Smem Tensor
+    cooperative_clear<MaxThreadsPerBlock>(threadIdx_X, cACsAC);
+    cooperative_clear<MaxThreadsPerBlock>(threadIdx_X, cEsE);
+
+    // * Input CTA Tensor G to S
+    if (GemmM_within_Cta == TensorEAtomM{} && GemmK_within_Cta == TensorEAtomK{}) {
+      copy_vec_pred<false, LayoutATag>(cAgA, cAsA, threadIdx_X, GemmM_within_Cta, GemmK_within_Cta);
+    }
+    else {
+      copy_vec_pred<true, LayoutATag>(cAgA, cAsA, threadIdx_X, GemmM_within_Cta, GemmK_within_Cta);
+    }
+
+    // * Compress
+    // cACsAC is always row major order
+    // TensorEAtomM threads perform the compression, each thread compress one row
+    const int row_i = threadIdx_X;
+    if (row_i < GemmM_within_Cta) {
+
+      CUTE_UNROLL
+      for (int col_chunk_i = 0; col_chunk_i < NumOneChunkK{}; ++col_chunk_i) {
+        if (col_chunk_i < GemmK_NumOneChunk_within_Cta) {
+          // Compress is handled in unit of ElementAMmaRawUnit
+          cute::Tensor tAsA   = cAsA_log_chunk(row_i, make_coord(_, col_chunk_i));
+          cute::Tensor tACsAC = cACsAC_log_chunk(row_i, make_coord(_, col_chunk_i));
+          cute::Tensor tEsE   = cEsE_log_chunk(row_i, make_coord(_, col_chunk_i));
+
+          int non_zero_cnt = 0;
+          // None zero element indx
+          // e.g.
+          //  2:4 sparsity [x 0 0 x]
+          //  non_zero_elt_log_idx = [0, 3]
+          int non_zero_elt_log_idx[OneChunkSizeAC{}] = { 0 };
+
+          // * Find None Zero Element Idx within Chunk
+          CUTE_UNROLL
+          for (int elt_log_idx = 0; elt_log_idx < OneChunkSizeA{}; ++elt_log_idx) {
+            ElementAMmaRawUnit elem_A = tAsA[elt_log_idx];
+            if ( elem_A != ElementAMmaRawUnit{0} ) {
+              non_zero_elt_log_idx[non_zero_cnt] = elt_log_idx;
+              tACsAC[non_zero_cnt] = elem_A;
+              non_zero_cnt++;
+            }
+          }
+
+          // * Corner Case for 2:4 sparsity
+          if constexpr (cute::sizeof_bits_v<ElementAMmaRawUnit> < 32) {
+            // i.e. [0 0 0 x] -> [(0) 0 0 x]
+            if (non_zero_cnt == 1 && non_zero_elt_log_idx[0] == 3) {
+              tACsAC[1] = tACsAC[0];
+              tACsAC[0] = ElementAMmaRawUnit{0};
+              non_zero_elt_log_idx[0] = 0;
+              non_zero_elt_log_idx[1] = 3;
+            }
+            // i.e. [0 0 x 0] -> [0 0 x (0)]
+            // i.e. [0 x 0 0] -> [0 x 0 (0)]
+            // i.e. [x 0 0 0] -> [x 0 0 (0)]
+            else if (non_zero_cnt == 1) {
+              tACsAC[1] = ElementAMmaRawUnit{0};
+              non_zero_elt_log_idx[1] = 3;
+            }
+          }
+
+          // * Set Metadata Bits
+          MetadataOneChunk metadata_one_chunk;
+          CUTE_UNROLL
+          for (int elt_phy_idx = 0; elt_phy_idx < OneChunkSizeAC{}; elt_phy_idx++) {
+            metadata_one_chunk.set_metadata_bits(non_zero_elt_log_idx[elt_phy_idx], elt_phy_idx);
+          }
+          tEsE[0] = metadata_one_chunk.storage();
+
+        }
+        else {
+          break;
+        }
+      }
+    }
+
+    // * Sync after Compress
+    syncthreads();
+
+    // * Output Cta Tensor S to G
+    if (GemmM_within_Cta > 0 && GemmK_within_Cta > 0) {
+      constexpr int MaxVecBits = 128; // STG.128
+      cute::cooperative_copy<MaxThreadsPerBlock, MaxVecBits>(threadIdx_X, cEsE, cEgE);
+    }
+
+    if (GemmMAlignedAC_within_Cta == TensorEAtomM{} && GemmKAlignedAC_within_Cta == TensorEAtomK{}) {
+      copy_vec_pred<false, LayoutATag>(cACsAC, cACgAC, threadIdx_X, GemmMAlignedAC_within_Cta, (GemmKAlignedAC_within_Cta / ElementASparsity::value));
+    }
+    else {
+      copy_vec_pred<true, LayoutATag>(cACsAC, cACgAC, threadIdx_X, GemmMAlignedAC_within_Cta, (GemmKAlignedAC_within_Cta / ElementASparsity::value));
+    }
+
+  } // end of structure_sparse_compress()
+
+  template<uint32_t NumThreads,
+           typename TensorSrc>
+  CUTE_DEVICE
+  static void
+  cooperative_clear(
+    uint32_t const& tid,
+    TensorSrc dSrc) {
+    
+    auto dSrctSrc = local_partition(dSrc, make_layout(make_shape(NumThreads, _1{})), tid);
+    cute::clear(dSrctSrc);
+
+    // Sync all thread data access
+    syncthreads();
+  }
+
+  template <bool pred,
+            typename LayoutTag,
+            typename TensorSrc,
+            typename TensorDst>
+  CUTE_DEVICE
+  static void
+  copy_vec_pred(
+      TensorSrc dSrc,
+      TensorDst dDst,
+      int threadIdx_X,
+      int valid_rows,
+      int valid_cols) {
+
+    constexpr bool IsRowMajor = cute::is_same_v<LayoutTag, cutlass::layout::RowMajor>;
+    using Element = typename TensorSrc::element_type;
+    CUTE_STATIC_ASSERT(cute::is_static_v<decltype(shape(dSrc))>, "shape(dSrc) needs to be static");
+    CUTE_STATIC_ASSERT(cute::is_static_v<decltype(shape(dDst))>, "shape(dDst) needs to be static");
+    CUTE_STATIC_ASSERT(cute::sizeof_bits_v<typename TensorSrc::element_type> == cute::sizeof_bits_v<typename TensorDst::element_type>,
+      "dSrc and dDst need to have same element bit width");
+    CUTE_STATIC_ASSERT(cute::size(dSrc) == cute::size(dDst), "dSrc and dDst need to have same size");
+
+    // ValueShape
+    using ValueShape = 
+      cute::conditional_t<IsRowMajor,
+                          Shape<Int<1>, Int<128 / sizeof_bits_v<Element>>>,
+                          Shape<Int<128 / sizeof_bits_v<Element>>, Int<1>>>
+      ;
+
+    constexpr int ValueShapeRows = shape<0>(ValueShape{});
+    constexpr int ValueShapeCols = shape<1>(ValueShape{});
+
+    // ThreadShape
+    using ThreadShape = 
+      cute::conditional_t<IsRowMajor,
+                          Shape<Int<MaxThreadsPerBlock / (shape<1>(dSrc) / ValueShapeCols)>, Int<                     (shape<1>(dSrc) / ValueShapeCols)>>,
+                          Shape<Int<                     (shape<0>(dSrc) / ValueShapeRows)>, Int<MaxThreadsPerBlock / (shape<0>(dSrc) / ValueShapeRows)>>>
+      ;
+
+    constexpr int ThreadShapeRows = shape<0>(ThreadShape{});
+    constexpr int ThreadShapeCols = shape<1>(ThreadShape{});
+
+    const int threadIdx_X_row = threadIdx_X / ThreadShapeCols;
+    const int threadIdx_X_col = threadIdx_X % ThreadShapeCols;
+
+    // Row Major
+    if constexpr (IsRowMajor) {
+      CUTE_UNROLL
+      for (int iter_row_blk = 0; iter_row_blk < cutlass::ceil_div(shape<0>(dSrc), ThreadShapeRows * ValueShapeRows); ++iter_row_blk) {
+        CUTE_UNROLL
+        for (int col_chunk_i = 0; col_chunk_i < cutlass::ceil_div(shape<1>(dSrc) , ThreadShapeCols * ValueShapeCols); ++col_chunk_i) {
+          CUTE_UNROLL
+          for (int iter_row_thr = 0; iter_row_thr < ValueShapeRows; ++iter_row_thr) {
+            CUTE_UNROLL
+            for (int iter_col_thr = 0; iter_col_thr < ValueShapeCols; ++iter_col_thr) {
+              const int row_i = (iter_row_blk * ThreadShapeRows + threadIdx_X_row) * ValueShapeRows + iter_row_thr;
+              const int col_i = (col_chunk_i * ThreadShapeCols + threadIdx_X_col) * ValueShapeCols + iter_col_thr;
+              if constexpr ( (not pred) 
+              ) {
+                dDst(row_i, col_i) = dSrc(row_i, col_i);
+              }
+              else {
+                if (row_i < valid_rows && col_i < valid_cols) {
+                  dDst(row_i, col_i) = dSrc(row_i, col_i);
+                }
+              }
+            }
+          }
+        }
+      }
+    }
+    // Col Major
+    else {
+      CUTE_UNROLL
+      for (int col_chunk_i = 0; col_chunk_i < cutlass::ceil_div(shape<1>(dSrc) , ThreadShapeCols * ValueShapeCols); ++col_chunk_i) {
+        CUTE_UNROLL
+        for (int iter_row_blk = 0; iter_row_blk < cutlass::ceil_div(shape<0>(dSrc), ThreadShapeRows * ValueShapeRows); ++iter_row_blk) {
+          CUTE_UNROLL
+          for (int iter_col_thr = 0; iter_col_thr < ValueShapeCols; ++iter_col_thr) {
+            CUTE_UNROLL
+            for (int iter_row_thr = 0; iter_row_thr < ValueShapeRows; ++iter_row_thr) {
+              const int row_i = (iter_row_blk * ThreadShapeRows + threadIdx_X_row) * ValueShapeRows + iter_row_thr;
+              const int col_i = (col_chunk_i * ThreadShapeCols + threadIdx_X_col) * ValueShapeCols + iter_col_thr;
+              if constexpr ( (not pred)
+              ) {
+                dDst(row_i, col_i) = dSrc(row_i, col_i);
+              }
+              else {
+                if (row_i < valid_rows && col_i < valid_cols) {
+                  dDst(row_i, col_i) = dSrc(row_i, col_i);
+                }
+              }
+            }
+          }
+        }
+      }
+    }
+  
+    // Sync all thread data access
+    syncthreads();
+  } // end of copy_vec_pred()
+  
+};
+
+}  // namespace cutlass::transform::kernel
diff --git a/include/cutlass/transform/kernel/sparse_gemm_compressor.hpp b/include/cutlass/transform/kernel/sparse_gemm_compressor.hpp
new file mode 100644
index 0000000000..51f42e9fd5
--- /dev/null
+++ b/include/cutlass/transform/kernel/sparse_gemm_compressor.hpp
@@ -0,0 +1,284 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+/*! \file
+  \brief Compress utils for structured sparse kernels
+*/
+
+#pragma once
+
+#include <algorithm>                       // std::fill
+#include <array>                           // std::array
+#include <random>                          // std::mt19937
+
+#include "cute/numeric/numeric_types.hpp"  // cute::sizeof_bits_v
+#include "cute/tensor.hpp"                 // cute::Tensor, cute::make_tensor
+#include "cutlass/arch/arch.h"             // cutlass::arch::SmXY
+#include "cutlass/gemm/gemm.h"             // cutlass::TagToStrideA_t
+#include "cutlass/fast_math.h"             // cutlass::ceil_div, cutlass::round_up
+#include "cutlass/numeric_size.h"          // cutlass::bits_to_bytes
+
+#include "cutlass/transform/kernel/sm90_sparse_gemm_compressor.hpp"
+
+namespace cutlass::transform::kernel {
+
+template<
+  class ProblemShape_,
+  class ElementA_,
+  class LayoutATag_,
+  class SparseConfig_
+>
+class StructuredSparseCompressorUtility {
+public:
+  using SparseConfig = SparseConfig_;
+  using ProblemShape = ProblemShape_;
+
+  //* EltA
+  using ElementA = ElementA_;
+  using LayoutATag = LayoutATag_;
+  using StrideA = cutlass::gemm::TagToStrideA_t<LayoutATag>;
+  using ElementAMmaRaw = typename SparseConfig::ElementAMmaRaw;
+  using ElementASparsity = typename SparseConfig::ElementASparsity;
+  using ElementAMmaSparsity = typename SparseConfig::ElementAMmaSparsity;
+
+  //* EltE
+  using ElementEMmaRaw = typename SparseConfig::ElementEMmaRaw;
+  using ElementEMmaSparsity = typename SparseConfig::ElementEMmaSparsity;
+
+  //* AtomE
+  using TensorEAtom = typename SparseConfig::TensorEAtom;
+  using TensorEAtomK = typename SparseConfig::TensorEAtomK;
+  using TensorEAtomM = typename SparseConfig::TensorEAtomM;
+
+  static constexpr int ElemsARawPerElementAMmaRaw = typename SparseConfig::ElemsARawPerElementAMmaRaw{};
+  static constexpr int LogicalElemsAPerChunk = typename SparseConfig::LogicalElemsAPerChunk{};
+  static constexpr int PhysicalElemsAPerChunk = typename SparseConfig::PhysicalElemsAPerChunk{};
+  static constexpr int LogicalElemsAMmaRawPerChunk = cutlass::ceil_div(LogicalElemsAPerChunk, ElemsARawPerElementAMmaRaw);
+  static constexpr int PhysicalElemsAMmaRawPerChunk = cutlass::ceil_div(PhysicalElemsAPerChunk, ElemsARawPerElementAMmaRaw);
+
+  //* Alignment
+  static constexpr int TensorEAlignmentM = typename SparseConfig::TensorEAlignmentM{};
+  static constexpr int TensorEAlignmentK = typename SparseConfig::TensorEAlignmentK{};
+  static constexpr int TensorAAlignmentK = typename SparseConfig::TensorAAlignmentK{};
+  static constexpr int TensorAAlignmentM = typename SparseConfig::TensorAAlignmentM{};
+
+  StructuredSparseCompressorUtility() = default;
+
+  StructuredSparseCompressorUtility(ProblemShape problem, StrideA dA) {
+    set_problem_size(problem, dA);
+  }
+
+  void set_problem_size(ProblemShape problem, StrideA dA_) {
+    M = cute::size<0>(problem);
+    K = cute::size<2>(problem);
+    L = cute::size<3>(problem);
+
+    // The following three vars are logical elem count!
+    K_alignedA  = round_up(K, TensorAAlignmentK);
+    M_alignedA  = round_up(M, TensorAAlignmentM);
+    K_alignedE = round_up(K, TensorEAlignmentK);
+    M_alignedE = round_up(M, TensorEAlignmentM);
+
+    dA = dA_;
+  }
+
+  /**
+   * @brief Get the TensorE number of ElementE along K after alignment requirement
+   * 
+   * @return int : number of ElementE (uint8_t) along K-dim
+   */
+  int get_metadata_m_physical() const {
+    return M_alignedE;
+  }
+
+  /**
+   * @brief Get the TensorE number of ElementE along M after alignment requirement
+   * 
+   * @return int : number of ElementE (uint8_t) along M-dim
+   */
+  int get_metadata_k_physical() const {
+    return K_alignedE / ElementEMmaSparsity{};
+  }
+
+  /**
+   * @brief Get the TensorACompressed number of ElementA along K after alignment requirement
+   * 
+   * @return int : number of ElementA along K-dim
+   */
+  int get_tensorA_k_physical() const {
+    return K_alignedA / ElementASparsity{};
+  }
+
+  /**
+   * @brief Get the TensorACompressed number of ElementA along M after alignment requirement
+   * 
+   * @return int : number of ElementA along M-dim
+   */
+  int get_tensorA_m_physical() const {
+    return M_alignedA;
+  }
+
+  /**
+   * @brief Get the TensorACompressed Bytes
+   * 
+   * @return uint64_t bytes
+   */
+  uint64_t get_compressed_tensor_A_bytes() const {
+    const auto tensor_a_comp_num_elt_a = get_tensorA_m_physical() * get_tensorA_k_physical() * L;
+    const auto tensor_a_comp_bytes = cutlass::bits_to_bytes<uint64_t>(tensor_a_comp_num_elt_a * cute::sizeof_bits_v<ElementA>);
+    return tensor_a_comp_bytes;
+  }
+
+  /**
+   * @brief Get the TensorA Bytes
+   * 
+   * @return uint64_t bytes
+   */
+  uint64_t get_raw_tensor_A_bytes() const {
+    const auto tensor_a_num_elt_a = uint64_t(M) * uint64_t(K) * uint64_t(L);
+    const auto tensor_a_bytes = cutlass::bits_to_bytes<uint64_t>(tensor_a_num_elt_a * cute::sizeof_bits_v<ElementA>);
+    return tensor_a_bytes;
+  }
+
+  /**
+   * @brief Get the TensorE Bytes
+   * 
+   * @return uint64_t bytes
+   */
+  uint64_t get_tensor_E_bytes() const {
+    const auto tensor_e_num_elt_a = uint64_t(get_metadata_m_physical()) * uint64_t(get_metadata_k_physical()) * uint64_t(L);
+    const auto tensor_e_bytes = cutlass::bits_to_bytes<uint64_t>(tensor_e_num_elt_a * cute::sizeof_bits_v<ElementEMmaRaw>);
+    return tensor_e_bytes;
+  }
+
+  constexpr auto fill_layoutA_from_compressor() const {
+    return SparseConfig::fill_layoutA(cute::make_tuple(M,_1{},K,L));
+  }
+
+  constexpr auto fill_layoutE_from_compressor() const {
+    return SparseConfig::fill_layoutE(cute::make_tuple(M,_1{},K,L));
+  }
+
+  void structure_sparse_zero_mask_fill(void* host_a_ptr, uint64_t seed) {
+    
+    constexpr int ChunkSize = LogicalElemsAMmaRawPerChunk;
+    using ChunkElement = cute::uint_bit_t<cute::sizeof_bits_v<ElementAMmaRaw>>;
+
+    cute::Tensor gA_eltA = cute::make_tensor(
+        cute::recast_ptr<ElementA>(host_a_ptr),
+        cute::make_layout(make_shape(M, K, L), dA));
+
+    // Input TensorA is handled in unit of ElementAMmaRaw instead of ElementA
+    cute::Tensor gA = cute::recast<ChunkElement>(gA_eltA);
+
+    // Extract out the Chunk from K-mode
+    Tensor gA_chunk = cute::zipped_divide(gA, cute::Shape<_1,cute::Int<ChunkSize>>{}); // (Chunk, Rest)
+
+    // Half of the data is zero to indicate sparsityA = 2
+    std::array<int, ChunkSize> nnzb_indicator{};
+    for (size_t i = 1; i < nnzb_indicator.size(); i += 2) {
+      nnzb_indicator.at(i) = 1;
+    }
+
+    std::mt19937 rng(seed);
+    auto rest_shape = cute::shape<1>(gA_chunk);
+    for (auto iter = cute::make_coord_iterator(rest_shape); iter != cute::ForwardCoordIteratorSentinel{}; ++iter) {
+      std::shuffle(nnzb_indicator.begin(), nnzb_indicator.end(), rng);
+      for (int c = 0; c < size<0>(gA_chunk); ++c) {                        // for each elem within chunk
+        if (nnzb_indicator[c] == 0) {
+          gA_chunk(c, *iter) = ChunkElement{0};
+        }
+      }  // end of within chunk
+    }    // end of chunk_idx
+  }
+
+  int M{-1};
+  int K{-1};
+  int L{-1};
+  StrideA dA{};
+
+private:
+  int K_alignedA{-1};
+  int M_alignedA{-1};
+  int K_alignedE{-1};
+  int M_alignedE{-1};
+};
+
+////////////////////////////////////////////////////////////////////////////////
+
+template<
+  class ProblemShape,
+  class ElementA,
+  class LayoutATag,
+  class SparseConfig,
+  class ArchTag
+>
+struct StructuredSparseCompressorSelector {
+  static_assert(cutlass::detail::dependent_false<ArchTag>,
+      "Could not select a structured sparse compressor for given parameters.");
+};
+
+template<
+  class ProblemShape,
+  class ElementA,
+  class LayoutATag,
+  class SparseConfig
+>
+struct StructuredSparseCompressorSelector<
+    ProblemShape,
+    ElementA,
+    LayoutATag,
+    SparseConfig,
+    arch::Sm90> {
+  using Compressor = SM90StructuredSparseCompressor<
+    ProblemShape,
+    ElementA,
+    LayoutATag,
+    SparseConfig
+  >;
+};
+
+template<
+  class ProblemShape,
+  class ElementA,
+  class LayoutATag,
+  class SparseConfig,
+  class ArchTag
+>
+using StructuredSparseCompressor = typename StructuredSparseCompressorSelector<
+    ProblemShape,
+    ElementA,
+    LayoutATag,
+    SparseConfig,
+    ArchTag
+>::Compressor;
+
+} // End namespace cutlass::transform::kernel
diff --git a/include/cutlass/transform/threadblock/regular_tile_iterator_pitch_linear.h b/include/cutlass/transform/threadblock/regular_tile_iterator_pitch_linear.h
index 1dfde983e9..1e04c4262d 100644
--- a/include/cutlass/transform/threadblock/regular_tile_iterator_pitch_linear.h
+++ b/include/cutlass/transform/threadblock/regular_tile_iterator_pitch_linear.h
@@ -45,7 +45,7 @@
 #include "cutlass/layout/matrix.h"
 #include "cutlass/layout/pitch_linear.h"
 
-#include "regular_tile_iterator.h"
+#include "cutlass/transform/threadblock/regular_tile_iterator.h"
 
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
diff --git a/include/cutlass/transform/threadblock/regular_tile_iterator_pitch_linear_2dthreadtile.h b/include/cutlass/transform/threadblock/regular_tile_iterator_pitch_linear_2dthreadtile.h
index a62ff8dd6b..7fd4959845 100644
--- a/include/cutlass/transform/threadblock/regular_tile_iterator_pitch_linear_2dthreadtile.h
+++ b/include/cutlass/transform/threadblock/regular_tile_iterator_pitch_linear_2dthreadtile.h
@@ -45,7 +45,7 @@
 #include "cutlass/layout/matrix.h"
 #include "cutlass/layout/pitch_linear.h"
 
-#include "regular_tile_iterator.h"
+#include "cutlass/transform/threadblock/regular_tile_iterator.h"
 
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
diff --git a/include/cutlass/uint128.h b/include/cutlass/uint128.h
index 01d6f6ab57..2474b4c3f8 100644
--- a/include/cutlass/uint128.h
+++ b/include/cutlass/uint128.h
@@ -49,7 +49,9 @@
 /// Optionally enable GCC's built-in type
 #if (defined(__x86_64) || defined (__aarch64__)) && !((defined(__CUDA_ARCH__) || defined(__SYCL_DEVICE_ONLY__)) && ((__CUDACC_VER_MAJOR__ <= 10) || ((__CUDACC_VER_MAJOR__ == 11) && (__CUDACC_VER_MINOR__ <= 4)))) && defined(__GNUC__)
 #define CUTLASS_UINT128_NATIVE
-#elif defined(_MSC_VER) && defined(_M_AMD64) && !(defined(__CUDA_ARCH__) && ((__CUDACC_VER_MAJOR__ <= 10) || ((__CUDACC_VER_MAJOR__ == 11) && (__CUDACC_VER_MINOR__ <= 4))))
+#elif !defined(__CUDA_ARCH__)
+// No custom support for 128b arithmetic on device
+#if defined(_MSC_VER) && defined(_M_AMD64)
 #define CUTLASS_INT128_ARITHMETIC
 #include <intrin.h>
 #if _MSC_VER >= 1920 && !defined(__CUDA_ARCH__)
@@ -57,6 +59,7 @@
 #include <immintrin.h>
 #endif
 #endif
+#endif
 
 namespace cutlass {
 
diff --git a/include/cutlass/version.h b/include/cutlass/version.h
index bcfce6c3a7..ff9aa11576 100644
--- a/include/cutlass/version.h
+++ b/include/cutlass/version.h
@@ -35,8 +35,8 @@
 #include <string>
 
 #define CUTLASS_MAJOR 3
-#define CUTLASS_MINOR 5
-#define CUTLASS_PATCH 1
+#define CUTLASS_MINOR 6
+#define CUTLASS_PATCH 0
 
 #ifdef CUTLASS_VERSIONS_GENERATED
 #include "cutlass/version_extended.h"
diff --git a/media/docs/cute/0x_gemm_tutorial.md b/media/docs/cute/0x_gemm_tutorial.md
index 533d4b4be0..a2f4dd2e96 100644
--- a/media/docs/cute/0x_gemm_tutorial.md
+++ b/media/docs/cute/0x_gemm_tutorial.md
@@ -106,10 +106,10 @@ How do we translate this into the BLAS user's experience?
 
 | BLAS | A Majorness | A Layout        | B Majorness | B Layout        |
 | ---  | ---         | ---             | ---         | ---             |
-| NT   | M-major     | `(M,K):(1,ldA)` | N-major     | `(N,K):(1,ldA)` |
+| NT   | M-major     | `(M,K):(1,ldA)` | N-major     | `(N,K):(1,ldB)` |
 | TN   | K-major     | `(M,K):(ldA,1)` | K-major     | `(N,K):(ldB,1)` |
 | NN   | M-major     | `(M,K):(1,ldA)` | K-major     | `(N,K):(ldB,1)` |
-| TT   | K-major     | `(M,K):(ldA,1)` | N-major     | `(N,K):(1,ldA)` |
+| TT   | K-major     | `(M,K):(ldA,1)` | N-major     | `(N,K):(1,ldB)` |
 
 Regardless, we'll still use the BLAS "NT" and "TN" notations for high-level descriptions of kernels when it's appropriate.
 
@@ -150,7 +150,7 @@ This `local_tile` is simply shorthand for
 1. apply the tiler via [`zipped_divide`](./02_layout_algebra.md#zipped-tiled-flat-divides)
 ```cpp
 // ((BLK_M,BLK_K),(m,k))
-Tensor gA_mk = zipped_divide(gA, select<0,2>(cta_tiler));
+Tensor gA_mk = zipped_divide(mA, select<0,2>(cta_tiler));
 ```
 2. apply the coord to the second mode, the "Rest" mode, to extract out the correct tiles for this CTA.
 ```cpp
diff --git a/media/docs/dependent_kernel_launch.md b/media/docs/dependent_kernel_launch.md
new file mode 100644
index 0000000000..76eadd20bc
--- /dev/null
+++ b/media/docs/dependent_kernel_launch.md
@@ -0,0 +1,32 @@
+[README](../../README.md#documentation) > **Dependent kernel launch**
+
+# Dependent kernel launches
+
+The Hopper architecture supports a new feature through which two kernels in the same stream can
+overlap their execution, named 
+[Programmatic Dependent Launch (PDL)](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#programmatic-dependent-launch-and-synchronization).
+This allows kernels with conflict in global memory to programmatically and safely overlap portions
+of their execution. Primary kernel can signal it is about to finish execution, and the next kernel can
+optionally wait on the previous kernel to finish flushing its memory.
+
+For more information, we refer you to the [PDL section in the CUDA Programming Guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#programmatic-dependent-launch-and-synchronization).
+
+## Using dependent launch in CUTLASS
+
+When building CUTLASS, you can use the `CUTLASS_ENABLE_GDC_FOR_SM90` macro to 
+enable PDL-related instructions in Hopper kernels:
+
+```
+cmake . -DCUTLASS_ENABLE_GDC_FOR_SM90=1
+```
+
+Note that this only adds PDL-related instructions to the _kernels_, but to actually allow a dependent
+launch, you must also run your GEMM kernel with PDL:
+
+```
+gemm.run(
+  /* stream = */ stream,
+  /* cuda_adapter = */ nullptr,
+  /* launch_with_pdl = */ true
+);_
+```
diff --git a/media/docs/profiler.md b/media/docs/profiler.md
index 4cf716492f..80c855e8e9 100644
--- a/media/docs/profiler.md
+++ b/media/docs/profiler.md
@@ -5,7 +5,7 @@
 # CUTLASS Profiler
 
 The CUTLASS Profiler is a command-line driven test and profiling environment for CUTLASS computations
-defined in the CUTLASS Instance Library. The CUTLASS Profiler is capable of executing each GEMM, Sparse Gemm, 
+defined in the CUTLASS Instance Library. The CUTLASS Profiler is capable of executing each GEMM, Sparse Gemm,
 Conv2d, and Conv3d kernel.
 
 The CUTLASS Profiler may be compiled with:
@@ -13,8 +13,8 @@ The CUTLASS Profiler may be compiled with:
 $ make cutlass_profiler -j
 ```
 
-To limit compilation time, only one tile size (typically 128x128) and threadblock cluster size (typically 2x1x1) is instantiated for each data type, 
-math instruction, and layout. To instantiate all sizes, set the following environment variable when running CMake from an 
+To limit compilation time, only one tile size (typically 128x128) and threadblock cluster size (typically 2x1x1) is instantiated for each data type,
+math instruction, and layout. To instantiate all sizes, set the following environment variable when running CMake from an
 empty `build/` directory.
 ```bash
 $ cmake .. -DCUTLASS_NVCC_ARCHS="70;75;80" -DCUTLASS_LIBRARY_KERNELS=all  -DCUTLASS_UNITY_BUILD_ENABLED=ON
@@ -24,7 +24,68 @@ $ make cutlass_profiler -j
 Enabling the unity build places multiple kernel instances in one compilation unit, thereby reducing size of the compiled
 binary and avoiding linker limitations on some platforms.
 
-The CUTLASS Profiler sources are stored in 
+### Instantiating more kernels with Hopper
+With Hopper (SM90), you will need to use an additional flag,
+`CUTLASS_LIBRARY_INSTANTIATION_LEVEL`, in order to instantiate all possible combinations,
+which unlike previous architectures, will be in the order of millions of kernels.
+Due to this, `CUTLASS_LIBRARY_KERNELS` must be non-empty, since generating and filtering these
+kernels alone can take hours.
+You must also exercise caution, because not all of these configs are tested, and some may fail to
+compile or fail to launch at runtime.
+
+```bash
+$ cmake .. \
+  -DCUTLASS_NVCC_ARCHS="90a" \
+  -DCUTLASS_LIBRARY_KERNELS="cutlass3x_sm90_tensorop_s64x64x16gemm_f16_f16_f32_void_f32_*" \
+  -DCUTLASS_LIBRARY_INSTANTIATION_LEVEL="max" \
+  -DCUTLASS_UNITY_BUILD_ENABLED=ON
+```
+
+The CUTLASS profiler employs a four-digit integer level (global instantiation level) mechanism to manage the generation of kernel configurations. This global instantiation level decides the behavior of multiple "generators" by defining how many and which combinations of configurations are produced. If a global instantiation level contains fewer than four digits, it can be padded with leading zeros to ensure it is four digits long. Each of the four digits in the global level corresponds to a specific category that influences kernel generation, from right to left:
+
+0. **Instruction Shape**
+1. **MMA Shape Multiplier**
+2. **Cluster Shape**
+3. **Schedule Pruning**
+
+Cluster shape levels define the number of CTAs (Cooperative Thread Arrays) included in the kernel generation:
+
+- **Level 0**: Only `(1, 2, 1)` cluster shape.
+- **Level 1**: Clusters with 2 CTAs.
+- **Level 2**: Clusters with 1 or 2 CTAs.
+- **Level 3**: Clusters with 1, 2, or 4 CTAs.
+- **Level 4**: Clusters with 1, 2, 4, or 8 CTAs.
+- **Level 5**: Clusters with 1, 2, 4, 8, or 16 CTAs.
+
+The MMA multipliers are combined with MMA instruction shapes (WGMMA shapes) to form CTA shapes. The levels for MMA multipliers determine the configurations generated for different data types.
+- **Levels [0, 3]**: Control the specific configurations generated for various data types.
+- **Level 9**: Activates exhaustive mode, generating all possible configurations.
+
+Higher levels encompass a broader range of CTA configurations, resulting in more comprehensive kernel generation.
+
+Instruction shape levels control the selection of WGMMA shapes used in kernel generation:
+
+- **Level 0**: Generates the "default" shape only.
+- **Level 1**: Includes additional shapes for unpruned cases, specifically for TF32 data type.
+- **Level 2**: Includes shapes that are powers of 2.
+- **Level 3**: Includes all other shapes.
+
+The detailed defination of the three instantiation levels controlling cluster shape, MMA shape multiplier, and instruction shape can be found in [sm90_shapes.py](../../python/cutlass_library/sm90_shapes.py).
+
+Schedule pruning levels decide the epilogue schedule and mainloop schedule to stamp out a kernel instance. As defined in `get_valid_schedules` in [sm90_utils.py](../../python/cutlass_library/sm90_utils.py),
+
+- **Level >= 1**: Indicates that no pruning is being applied.
+- **Level 0**: Indicates pruning according to existing [generator.py](../../python/cutlass_library/generator.py) behavior.
+
+An instantiation level `500`, which is padded to `0500`, thus indicates:
+
+- **Instruction Shapes**: At level 0, generating only the "default" shape.
+- **MMA Multipliers**: At level 0, generating only one multiplier, `(2, 1, 4)`.
+- **Cluster Sizes**: At level 5, allowing for clusters with 1, 2, 4, 8, or 16 CTAs.
+- **Schedule Pruning**: At level 0, where pruning is applied according to the existing `generator.py` behavior.
+
+The CUTLASS Profiler sources are stored in:
+
 ```bash
 tools/
   profiler/
@@ -65,6 +126,9 @@ Device:
                                                    profiling phases cycle through different input tensors to induce
                                                    capacity misses in the L2.
 
+  --allocations=<name>:<device>,<name>:<device>    Pairs of allocation names to devices. If <device> is negative,
+                                                   the execution device is used
+
 
 Initialization:
   --initialization=<bool>                          Enables initialization (default: true). If false, device memory is
@@ -90,8 +154,8 @@ Library:
 
 
 Profiling:
-  --workspace-count=<workspace count>              Number of discrete workspaces maintained to avoid cache-resident 
-                                                 If zero (default), the amount is chosen for each workload based on 
+  --workspace-count=<workspace count>              Number of discrete workspaces maintained to avoid cache-resident
+                                                 If zero (default), the amount is chosen for each workload based on
                                                  capacity of the last-level cache.
 
   --profiling-iterations=<iterations>              Number of iterations to profile each kernel. If zero, kernels
@@ -123,7 +187,7 @@ Verification:
 
 
 Report:
-  --append=<bool>                                  If true, result is appended to possibly existing file. Otherwise, 
+  --append=<bool>                                  If true, result is appended to possibly existing file. Otherwise,
                                                    any existing file is overwritten.
 
   --output=<path>                                  Path to output file for machine readable results. Operation kind and '.csv' is appended.
@@ -244,6 +308,9 @@ Test your changes to gemm kernels with a quick functional test and save results
    --k=8,16,32,64,128,256,288,384,504,512,520 \
    --beta=0,1,2 --profiling-iterations=1 \
    --providers=cutlass --output=functional-test.csv
+
+Profile when execution is performed on device 0 and the C tensor is located on a device 1 and D on device 2:
+  $ cutlass_profiler --device=0 --allocations=C:1,D:2 --operation=Gemm --m=1024 --n=1024 --k=128
 ```
 
 The format of tensor argument is followed by `<type>:<layout>`. The type could be `f32` as 32-bit floating point, `s8` as 8-bit signed integer, etc. The available types can be referred to the `NumericTypeID_enumerants` in [util.cu](tools/library/src/util.cu). The layout could be `row` or `column`.
@@ -322,7 +389,7 @@ $ ./tools/profiler/cutlass_profiler --op_class=tensorop --m=3456 --n=4096 --k=81
 ## Covering the problem space
 
 All arguments may have single values or comma-delimited set of values. Integers may also be specified
-as an inclusive range with the following syntax `start:end:increment` or simply `start:end`. 
+as an inclusive range with the following syntax `start:end:increment` or simply `start:end`.
 
 For example, the following sweeps over the range of the GEMM K dimension from 8 to 4096 in increments
 of 8 elements.
@@ -402,7 +469,7 @@ cutlass3x_sm90_tensorop_h64x128x16gemm_f16_f16_f16_void_f16_128x128x64_1x1x1_0_n
 ```
 
 * `warpspecialized_cooperative`: Mainloop employs a persistent warp-specialized mainloop and kernel schedule.
-* `epi_tma`: Kernel epilogue employs TMA based vectorization. 
+* `epi_tma`: Kernel epilogue employs TMA based vectorization.
 * `f16_f16_f16_void_f16`: In this case, C type is set to `void`, indicating that residual matrix support
 is disabled.
 
@@ -413,7 +480,7 @@ operator variants.
 
 The CUTLASS Profiler can be built with cuDNN enabled to use as a reference implementation. If CMake detects
 the cuDNN library available in the system, it is included as a dependency. This may be explicitly overridden
-with CMake flag `CUTLASS_ENABLE_CUDNN`. 
+with CMake flag `CUTLASS_ENABLE_CUDNN`.
 
 ```bash
 $ cmake .. -DCUTLASS_LIBRARY_OPERATIONS=conv2d -DCUTLASS_ENABLE_CUDNN=OFF
@@ -521,7 +588,7 @@ reference_device: Passed
 
 Example command line for profiling forward propagation convolution kernels runing on Tensor Cores is as follows:
 ```bash
-$ ./tools/profiler/cutlass_profiler --kernels=tensorop*fprop  --verification-providers=device --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3 
+$ ./tools/profiler/cutlass_profiler --kernels=tensorop*fprop  --verification-providers=device --n=8 --h=224 --w=224 --c=128 --k=128 --r=3 --s=3
 
 
 
diff --git a/media/docs/programming_guidelines.md b/media/docs/programming_guidelines.md
index d395c87e13..d6d5e16b26 100644
--- a/media/docs/programming_guidelines.md
+++ b/media/docs/programming_guidelines.md
@@ -69,6 +69,9 @@ of child objects are known to be non-overlapping, `union`s may be used to alias
 shared memory region and reduce overall shared memory capacity.  Developers should carefully note that C++ `union` rules
 require that they only access the most recently written ("active") member of the `union`; this differs from C rules.
 
+For host to device ABI compatibility, inheritance from a class is only permitted if the superclass is unique to the
+child class. This is most easily achieved by templating the parent class by the child class (CRTP).
+
 ### Loop Unrolling
 
 CUTLASS requires tiles of data to be stored in registers for high-bandwidth access. Simultaneously, high-throughput math instructions
@@ -1060,7 +1063,7 @@ constexpr auto second_form(T t) {
 
 In this form, the `else` branch had a `static_assert` that was meant always to fail if the `else` branch were taken, such as `static_assert(sizeof(T) < 0)`.  (Note that we cannot use `static_assert(false)` here, because it will ALWAYS fail at compile time, even if the `else` branch is not taken.  C++23 fixes this behavior, but CUTLASS currently requires that its code be compatible with C++17.  As a result, CUTLASS includes a `dependent_false<T>` library function that you can use in place of the always-`false` test `sizeof(T) < 0`.)
 
-One can suppress "missing return statement" warnings for both forms by invoking CUTLASS' function-like macro `CUTE_GCC_UNREACHABLE()`.  When building with GCC, this invokes the GCC-specific built-in function `__builtin_unreachable()`.  Actually calling this function is undefined behavior, so using this lets the programmer declare that the code path calling that function will never be taken.  (C++23 introduces the `std::unreachable()` function, which achieves the same goal.  Again, though, CUTLASS cannot currently use C++23 library functions.)  Here is an example of how to use `CUTE_GCC_UNREACHABLE()`.
+One can suppress "missing return statement" warnings for both forms by invoking CUTLASS' function-like macro `CUTE_GCC_UNREACHABLE`.  When building with GCC, this invokes the GCC-specific built-in function `__builtin_unreachable()`.  Actually calling this function is undefined behavior, so using this lets the programmer declare that the code path calling that function will never be taken.  (C++23 introduces the `std::unreachable()` function, which achieves the same goal.  Again, though, CUTLASS cannot currently use C++23 library functions.)  Here is an example of how to use `CUTE_GCC_UNREACHABLE`.
 
 ```c++
 template<class T>
@@ -1074,7 +1077,7 @@ constexpr auto second_form(T t) {
   else {
     static_assert(sizeof(T) < 0, "This branch always fails");
   }
-  CUTE_GCC_UNREACHABLE();
+  CUTE_GCC_UNREACHABLE;
 }
 ```
 
diff --git a/media/docs/quickstart.md b/media/docs/quickstart.md
index 7faad445d9..97ed6a631f 100644
--- a/media/docs/quickstart.md
+++ b/media/docs/quickstart.md
@@ -179,11 +179,13 @@ $ make test_unit_gemm_warp -j
 To minimize compilation time, specific GPU architectures can be enabled via the CMake command,
 selected by [CUDA Compute Capability.](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities)
 
-**NVIDIA Ampere Architecture.**
+**NVIDIA Hopper Architecture.**
 ```bash
 $ cmake .. -DCUTLASS_NVCC_ARCHS=90a              # compiles for NVIDIA Hopper GPU architecture
 ```
 
+**NVIDIA Ampere Architecture.**
+
 ```bash
 $ cmake .. -DCUTLASS_NVCC_ARCHS=80               # compiles for NVIDIA Ampere GPU architecture
 ```
diff --git a/media/docs/utilities.md b/media/docs/utilities.md
index 5698a54fd4..b179f2fa94 100644
--- a/media/docs/utilities.md
+++ b/media/docs/utilities.md
@@ -384,6 +384,54 @@ int main() {
 }
 ```
 
+## Debugging Asynchronous Kernels with CUTLASS's Built-in `synclog` Tool
+
+CUTLASS provides a built-in tool called `synclog` that enables printing runtime information useful for debugging asynchronous CUTLASS kernels. With the introduction of Warp Specialization in CUTLASS 3.0 for Hopper GPUs, kernel designs now incorporate synchronization among warps. The `synclog` tool simplifies debugging efforts for these asynchronous programs by recording and displaying timing information for synchronization events.
+
+### Enabling `synclog`
+To enable `synclog`, add the -DCUTLASS_ENABLE_SYNCLOG=1 flag during compilation. From the CUTLASS root directory:
+
+```
+$ mkdir build && cd build && 
+$ cmake .. -DCUTLASS_NVCC_ARCHS=90a -DCUTLASS_ENABLE_SYNCLOG=1
+```
+
+### Building and Running with `synclog`
+After enabling `synclog`, build your CUTLASS example. For instance, to build example 54:
+
+```
+$ cd examples/54_hopper_fp8_warp_specialized_gemm
+$ make
+```
+
+Run the example, setting the profiling iteration count to 0 to ensure `synclog` information is printed only for the reference run:
+
+```
+$ ./54_hopper_fp8_warp_specialized_gemm --iterations=0 &> synclog.txt
+```
+
+### Interpreting `synclog` output
+The synclog.txt file will contain runtime information about synchronization events. Here's a sample output snippet:
+
+```
+synclog start
+synclog at 1: cluster_barrier_init line=281 time=1725400116233388736 thread=0,0,0 block=0,0,0 smem_addr=197632 arrive_count=1
+synclog at 13: fence_barrier_init line=583 time=1725400116233388768 thread=32,0,0 block=0,0,0 
+...
+```
+
+Each line in the main body follows this format:
+```
+synclog at [synclog_at]: [header] line=[line] thread=[threadIdx.xyz] block=[blockIdx.xyz] 
+```
+* `synclog at`: Address in the `synclog` output buffer (in bytes). Output exceeding 2^26 bytes is discarded.
+* `header`: Name of the synchronization event.
+* `line`: Code line number of the synchronization operation calling into `synclog`.
+
+Additional information may appear at the end of each line, such as shared memory address, phase bit, and arrive count. For more detailed information on `synclog` output, refer to [synclog.hpp](../../include/cutlass/arch/synclog.hpp) in the CUTLASS source code. 
+
+Please note that `synclog` is an experimental feature, and its functionality is not always guaranteed. We encourage its use in custom kernels and CUTLASS examples, though it is known to be incompatible with profiler kernels.
+
 # Copyright
 
 Copyright (c) 2017 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
diff --git a/media/images/cutlass-3.5.1-gemm-peak-performance-fp8.png b/media/images/cutlass-3.5.1-gemm-peak-performance-fp8.png
new file mode 100644
index 0000000000..bca203c0cb
Binary files /dev/null and b/media/images/cutlass-3.5.1-gemm-peak-performance-fp8.png differ
diff --git a/media/images/cutlass-3.5.1-gemm-peak-performance.png b/media/images/cutlass-3.5.1-gemm-peak-performance.png
new file mode 100644
index 0000000000..90b0bd6d20
Binary files /dev/null and b/media/images/cutlass-3.5.1-gemm-peak-performance.png differ
diff --git a/pyproject.toml b/pyproject.toml
index 61c371a23d..ef8f1db29c 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 
 [project]
 name = "nvidia-cutlass"
-version = "3.5.1.0"
+version = "3.6.0.0"
 description = "CUTLASS"
 readme = "README.md"
 requires-python = ">=3.8"
diff --git a/python/cutlass/__init__.py b/python/cutlass/__init__.py
index dfc9b40509..f9723c4651 100644
--- a/python/cutlass/__init__.py
+++ b/python/cutlass/__init__.py
@@ -121,7 +121,7 @@ def get_option_registry():
         this._option_registry = OptionRegistry(device_cc())
     return this._option_registry
 
-this.__version__ = '3.5.1'
+this.__version__ = '3.6.0'
 
 from cutlass.backend import create_memory_pool
 from cutlass.emit.pytorch import pytorch
diff --git a/python/cutlass/backend/epilogue.py b/python/cutlass/backend/epilogue.py
index d39ccae3a3..e0b5e9574f 100644
--- a/python/cutlass/backend/epilogue.py
+++ b/python/cutlass/backend/epilogue.py
@@ -37,7 +37,7 @@
 from scipy.special import erf
 
 from cutlass_library import DataType, DataTypeTag
-from cutlass.backend.c_types import MatrixCoord_
+from cutlass.backend.c_types import MatrixCoord_, tuple_factory
 from cutlass.backend.frontend import NumpyFrontend
 from cutlass.backend.library import ActivationOp, ActivationOpTag
 from cutlass.utils.datatypes import is_numpy_tensor, is_torch_available, is_torch_tensor
@@ -162,11 +162,15 @@ class _EpilogueOutputOpParamsEVT(ctypes.Structure):
             Epilogue params when using the default linear combination of EVT, which
             does not currently use {alpha,beta}_ptr_array
             """
+
+            stride_type = tuple_factory((0,0,1), "int64_t", [0])
             _fields_ = [
                 ("alpha", c_element_epilogue),
                 ("beta", c_element_epilogue),
                 ("alpha_ptr", ctypes.c_void_p),
                 ("beta_ptr", ctypes.c_void_p),
+                ("dalpha", stride_type),
+                ("dbeta", stride_type),
             ]
 
             def __init__(self, alpha, beta, *args) -> None:
diff --git a/python/cutlass/backend/evt/backend/sm90_nodes.py b/python/cutlass/backend/evt/backend/sm90_nodes.py
index 477aab9dab..acdc4f4720 100644
--- a/python/cutlass/backend/evt/backend/sm90_nodes.py
+++ b/python/cutlass/backend/evt/backend/sm90_nodes.py
@@ -164,7 +164,7 @@ def type_decl(self):
 
         self._type_decl = f"""
 using {self.name_camel} = cutlass::epilogue::fusion::Sm90RowBroadcast<
-    0 /*Stages*/, typename EpilogueDescriptor::TileShape, {DataTypeTag[self.element]},
+    0 /*Stages*/, typename EpilogueDescriptor::TileShape, {DataTypeTag[self.element]}, {DataTypeTag[self.element_output]},
     {self.stride_mnl}
 >;
 """
@@ -183,7 +183,7 @@ def type_decl(self):
 
         self._type_decl = f"""
 using {self.name_camel} = cutlass::epilogue::fusion::Sm90ColBroadcast<
-    0 /*Stages*/, typename EpilogueDescriptor::TileShape, {DataTypeTag[self.element]},
+    0 /*Stages*/, typename EpilogueDescriptor::TileShape, {DataTypeTag[self.element]}, {DataTypeTag[self.element_output]},
     {self.stride_mnl}
 >;
 """
diff --git a/python/cutlass/backend/evt/frontend/python_ast.py b/python/cutlass/backend/evt/frontend/python_ast.py
index faffce65df..3f33485456 100644
--- a/python/cutlass/backend/evt/frontend/python_ast.py
+++ b/python/cutlass/backend/evt/frontend/python_ast.py
@@ -70,6 +70,8 @@ def ast_op_to_bindings(op):
             ast.Sub: FunctionalOp.Minus,
             ast.Mult: FunctionalOp.Multiplies,
             ast.Div: FunctionalOp.Divides,
+            "maximum": FunctionalOp.Maximum,
+            "minimum": FunctionalOp.Minimum,
             "relu": relu.binding_type,
             "multiply_add": FunctionalOp.MultiplyAdd,
             "sum": (FunctionalOp.Plus, FunctionalOp.AtomicAdd),
diff --git a/python/cutlass/backend/operation.py b/python/cutlass/backend/operation.py
index 568c1f6912..a73cef6857 100644
--- a/python/cutlass/backend/operation.py
+++ b/python/cutlass/backend/operation.py
@@ -36,7 +36,7 @@
 
 from cutlass.backend.utils.device import device_cc
 
-_version_splits = [int(x) for x in __version__.split("rc")[0].split(".")]
+_version_splits = [int(x) for x in __version__.split("rc")[0].split(".post")[0].split(".")]
 _supports_cluster_launch = None
 
 
diff --git a/python/cutlass/emit/pytorch.py b/python/cutlass/emit/pytorch.py
index ac13e866fa..8c10f87a9b 100644
--- a/python/cutlass/emit/pytorch.py
+++ b/python/cutlass/emit/pytorch.py
@@ -253,8 +253,9 @@
     DataType.f16: "torch::kF16",
     DataType.f32: "torch::kF32",
     DataType.f64: "torch::kF64",
-    DataType.s8: "torch::I8",
-    DataType.s32: "torch::I32",
+    DataType.s8: "torch::kI8",
+    DataType.s32: "torch::kI32",
+    DataType.bf16: "torch::kBFloat16",
 }
 
 _PYTORCH_GEMM_IMPL_TEMPLATE_2x = (
diff --git a/python/cutlass/epilogue/__init__.py b/python/cutlass/epilogue/__init__.py
index 2b22b5f582..423deccebc 100644
--- a/python/cutlass/epilogue/__init__.py
+++ b/python/cutlass/epilogue/__init__.py
@@ -49,5 +49,7 @@
     multiply_add,
     sum,
     permute,
-    reshape
+    reshape,
+    maximum,
+    minimum,
 )
diff --git a/python/cutlass/epilogue/evt_ops.py b/python/cutlass/epilogue/evt_ops.py
index a9b9b5bf4b..575767d03f 100644
--- a/python/cutlass/epilogue/evt_ops.py
+++ b/python/cutlass/epilogue/evt_ops.py
@@ -59,6 +59,17 @@ def max(x, dim):
     elif is_torch_tensor(x):
         return torch.amax(x, dim)
 
+def maximum(x, y):
+    if is_numpy_tensor(x):
+        return np.maximum(x, y)
+    elif is_torch_tensor(x):
+        return torch.maximum(x, torch.tensor(y))
+    
+def minimum(x, y):
+    if is_numpy_tensor(x):
+        return np.minimum(x, y)
+    elif is_torch_tensor(x):
+        return torch.minimum(x, torch.tensor(y))
 
 ##############################################################################
 # Layout manipulate nodes
diff --git a/python/cutlass_library/conv3x_emitter.py b/python/cutlass_library/conv3x_emitter.py
index 84d42a3ad5..29bc4a8f97 100644
--- a/python/cutlass_library/conv3x_emitter.py
+++ b/python/cutlass_library/conv3x_emitter.py
@@ -94,9 +94,12 @@ def __init__(self):
     ${kernel_schedule}
   >::CollectiveOp;
 
+using ${operation_name}_problem_shape = cutlass::conv::ConvProblemShape<${conv_kind}, ${operation_name}_mainloop::NumSpatialDimensions>;
+
 // Unit tests call this "ConvKernel".
 // Conv operator ${operation_name}
 using ${operation_name}_base = cutlass::conv::kernel::ConvUniversal<
+    ${operation_name}_problem_shape,
     ${operation_name}_mainloop,
     ${operation_name}_epilogue,
     ${tile_scheduler}
diff --git a/python/cutlass_library/gemm_operation.py b/python/cutlass_library/gemm_operation.py
index 5e015492f4..6f8483ed82 100644
--- a/python/cutlass_library/gemm_operation.py
+++ b/python/cutlass_library/gemm_operation.py
@@ -178,30 +178,16 @@ def extended_name(self):
     if self.is_complex():
       extended_name = "${core_name}"
     else:
-      # e.g. f16_f16_f32_void_f32 kernel
-      if self.C.element != self.tile_description.math_instruction.element_accumulator and \
-         self.A.element != self.tile_description.math_instruction.element_accumulator:
-        extended_name = "${element_c}_${core_name}_${element_a}"
-        if self.is_mixed_input():
-          extended_name += "_${element_b}"
-
-      # e.g. f32_f32_f32_void_f32 kernel
-      elif self.C.element != self.tile_description.math_instruction.element_accumulator and \
-           self.A.element == self.tile_description.math_instruction.element_accumulator:
-        extended_name = "${element_c}_${core_name}"
-        if self.is_mixed_input():
-          extended_name += "_${element_b}"
-
-      # e.g. f16_f16_f32_f32_f32 kernel
-      elif self.C.element == self.tile_description.math_instruction.element_accumulator and  \
-           self.A.element != self.tile_description.math_instruction.element_accumulator:
-        extended_name = "${core_name}_${element_a}"
-        if self.is_mixed_input():
-          extended_name += "_${element_b}"
-
-      # e.g. f32_f32_f32_f32_f32 kernel
+      if self.is_mixed_input():
+        extended_name = "${core_name}_${element_a}_${element_b}"
+        if self.C.element != self.tile_description.math_instruction.element_accumulator:
+          extended_name = "${element_c}_" + extended_name
       else:
         extended_name = "${core_name}"
+        if self.C.element != self.tile_description.math_instruction.element_accumulator:
+          extended_name = "${element_c}_" + extended_name
+        if self.A.element != self.tile_description.math_instruction.element_accumulator:
+          extended_name += "_${element_a}"
 
     extended_name = SubstituteTemplate(extended_name, {
       'element_a': DataTypeNames[self.A.element],
@@ -724,14 +710,14 @@ def __init__(self, operation_suffix = ''):
       "cutlass/gemm/collective/collective_builder.hpp",
       "cutlass/epilogue/collective/collective_builder.hpp",
     ]
-    self.builtin_epilogue_functor_template = """
-    ${epilogue_functor}<
+    self.builtin_epilogue_functor_template = \
+"""${epilogue_functor}<
       ${element_d},
       ${element_epilogue},
       ${element_c},
       ${element_epilogue}
-    >
-"""
+    >"""
+
     self.gemm_template = """
 
 using ${operation_name}_epilogue =
@@ -792,7 +778,6 @@ def emit(self, operation):
 
     opcode_class_main = operation.tile_description.math_instruction.opcode_class
     opcode_class_epi = opcode_class_main
-
     tile_shape = operation.tile_description.tile_shape
     instruction_shape = operation.tile_description.math_instruction.instruction_shape
     cluster_m = operation.tile_description.cluster_shape[0]
@@ -1071,14 +1056,14 @@ def __init__(self, operation_suffix = ''):
       "cutlass/gemm/kernel/default_gemm_grouped.h",
       "cutlass/gemm/device/gemm_grouped.h"
     ]
-    self.builtin_epilogue_functor_template = """
-    ${epilogue_functor}<
+    self.builtin_epilogue_functor_template = \
+"""${epilogue_functor}<
       ${element_c},
       ${epilogue_vector_length},
       ${element_accumulator},
       ${element_epilogue}
-    >
-"""
+    >"""
+
     self.gemm_template = """
 // Gemm operator ${operation_name}
 using ${operation_name}_base =
@@ -1197,6 +1182,7 @@ def __init__(self, operation_path, configuration_name):
       GemmKind.Sparse: EmitSparseGemmInstance,
       GemmKind.Universal: EmitGemmUniversalInstance,
       GemmKind.Universal3x: EmitGemmUniversal3xInstance,
+      GemmKind.SparseUniversal3x: EmitGemmUniversal3xInstance,
       GemmKind.PlanarComplex: EmitGemmPlanarComplexInstance,
       GemmKind.PlanarComplexArray: EmitGemmPlanarComplexArrayInstance,
       GemmKind.Grouped: EmitGemmGroupedInstance
@@ -1207,6 +1193,7 @@ def __init__(self, operation_path, configuration_name):
       GemmKind.Sparse: 'GemmSparseOperation',
       GemmKind.Universal: 'GemmUniversalOperation',
       GemmKind.Universal3x: 'GemmUniversal3xOperation',
+      GemmKind.SparseUniversal3x: 'SparseGemmUniversal3xOperation',
       GemmKind.PlanarComplex: 'GemmPlanarComplexOperation',
       GemmKind.PlanarComplexArray: 'GemmPlanarComplexArrayOperation',
       GemmKind.Grouped: 'GemmGroupedOperation'
@@ -1266,6 +1253,7 @@ def __enter__(self):
       ("library_internal.h", None),
       ("gemm_operation.h", None),
       ("gemm_operation_3x.hpp", None),
+      ("sparse_gemm_operation_3x.hpp", None),
       ("cutlass/arch/wmma.h", None),
       ("cutlass/numeric_types.h", None)
     ])
diff --git a/python/cutlass_library/generator.py b/python/cutlass_library/generator.py
index 8aa18b4b15..e6a9f9e8e5 100644
--- a/python/cutlass_library/generator.py
+++ b/python/cutlass_library/generator.py
@@ -103,7 +103,7 @@ def CudaToolkitVersionSatisfies(semantic_ver_string, major, minor, patch = 0):
 
   # Update cuda_version based on parsed string
   if semantic_ver_string != '':
-    for i, x in enumerate([int(x) for x in semantic_ver_string.split('.')]):
+    for i, x in enumerate([int(x) for x in semantic_ver_string.split('.')[:3]]):
       if i < len(cuda_version):
         cuda_version[i] = x
       else:
@@ -219,6 +219,54 @@ def CreateGemmUniversal3xOperator(
 
   return operations
 
+# Generates 3.0 API based GemmUniversal API kernels. Alignment constraints are folded in with layouts
+def CreateSparseGemmUniversal3xOperator(
+    manifest, layouts, tile_descriptions, data_types,
+    schedules = [[KernelScheduleType.ScheduleAuto, EpilogueScheduleType.ScheduleAuto]],
+    complex_transforms=None,
+    epilogue_functor=EpilogueFunctor.LinearCombination,
+    swizzling_functor=SwizzlingFunctor.Identity1,
+    tile_schedulers=[TileSchedulerType.Persistent]):
+
+  if type(data_types) is dict:
+    data_types = [data_types]
+
+  for s in schedules:
+    assert(len(s) == 2)
+
+  if complex_transforms is None:
+    complex_transforms = [(ComplexTransform.none, ComplexTransform.none), ]
+
+  operations = []
+
+  # by default, only generate the largest tile and largest alignment
+  if manifest.kernel_filter == '':
+    tile_descriptions = [tile_descriptions[0]]
+
+  combinations = product(layouts, tile_descriptions, data_types, complex_transforms, schedules, tile_schedulers)
+  for layout, tile_description, data_type, complex_transform, schedules, tile_scheduler in combinations:
+    kernel_schedule, epilogue_schedule = schedules
+    A = TensorDescription(
+        data_type["a_type"], layout[0][0], layout[0][1], complex_transform[0])
+    B = TensorDescription(
+        data_type["b_type"], layout[1][0], layout[1][1], complex_transform[1])
+
+    # Currently assume tensor C/D have same layout requirement.
+    C = TensorDescription(data_type["c_type"], layout[2][0], layout[2][1])
+    D = TensorDescription(data_type["d_type"], layout[2][0], layout[2][1])
+
+    element_compute = data_type.get("epi_type", data_type["acc_type"])
+
+    operation = GemmOperation(
+        GemmKind.SparseUniversal3x, tile_description.minimum_compute_capability,
+        tile_description, A, B, C, element_compute, epilogue_functor, swizzling_functor, D,
+        kernel_schedule, epilogue_schedule, tile_scheduler)
+
+    manifest.append(operation)
+    operations.append(operation)
+
+  return operations
+
 #
 def CreateSparseGemmOperator(manifest, layouts, tile_descriptions, data_type, \
   alignment_constraints, complex_transforms = None, epilogue_functor = EpilogueFunctor.LinearCombination, \
@@ -2575,17 +2623,17 @@ def GenerateSM80_TensorOp_16816_mixed_input_upcast_a(manifest, cuda_version):
   math_instructions = [
     MathInstruction(                                  \
       [16, 8, 16],                                    \
-      DataType.s8, DataType.f16, DataType.f16,        \
+      DataType.s8, DataType.f16, DataType.f32,        \
       OpcodeClass.TensorOp,                           \
       MathOperation.multiply_add_mixed_input_upcast),
     MathInstruction(                                  \
       [16, 8, 16],                                    \
-      DataType.s8, DataType.f16, DataType.f32,        \
+      DataType.u8, DataType.f16, DataType.f32,        \
       OpcodeClass.TensorOp,                           \
       MathOperation.multiply_add_mixed_input_upcast),
     MathInstruction(                                  \
       [16, 8, 16],                                    \
-      DataType.u8, DataType.f16, DataType.f32,        \
+      DataType.s8, DataType.bf16, DataType.f32,       \
       OpcodeClass.TensorOp,                           \
       MathOperation.multiply_add_mixed_input_upcast),
     MathInstruction(                                  \
@@ -2595,7 +2643,12 @@ def GenerateSM80_TensorOp_16816_mixed_input_upcast_a(manifest, cuda_version):
       MathOperation.multiply_add_mixed_input_upcast),
     MathInstruction(                                  \
       [16, 8, 16],                                    \
-      DataType.s8, DataType.bf16, DataType.f32,       \
+      DataType.s8, DataType.f16, DataType.f16,        \
+      OpcodeClass.TensorOp,                           \
+      MathOperation.multiply_add_mixed_input_upcast),
+    MathInstruction(                                  \
+      [16, 8, 16],                                    \
+      DataType.u8, DataType.f16, DataType.f16,        \
       OpcodeClass.TensorOp,                           \
       MathOperation.multiply_add_mixed_input_upcast),
   ]
@@ -2637,7 +2690,7 @@ def GenerateSM80_TensorOp_16816_mixed_input_upcast_a(manifest, cuda_version):
       data_type, alignment_constraints, None, EpilogueFunctor.LinearCombination, SwizzlingFunctor.Identity8)
 
     # Avoid emitting two kernels if the accumulator type does not differ from the input type (e.g. F16 accumulation)
-    if math_inst.element_a != math_inst.element_accumulator:
+    if math_inst.element_b != math_inst.element_accumulator:
 
       data_type_mixed = [
         math_inst.element_a,
@@ -2649,10 +2702,10 @@ def GenerateSM80_TensorOp_16816_mixed_input_upcast_a(manifest, cuda_version):
       operations += CreateGemmOperator(manifest, layouts, tile_descriptions, \
         data_type_mixed, alignment_constraints, None, EpilogueFunctor.LinearCombination, SwizzlingFunctor.Identity8)
 
-      for op in operations:
-        if (DataTypeSize[op.C.element] == 16) and \
-           (op.tile_description.threadblock_shape[1] <= 32):
-          op.C.alignment = 4
+    for op in operations:
+      if (DataTypeSize[op.C.element] == 16) and \
+         (op.tile_description.threadblock_shape[1] <= 32):
+        op.C.alignment = 4
 
 #
 def GenerateSM80_TensorOp_16816_mixed_input_upcast_b(manifest, cuda_version):
@@ -2672,12 +2725,12 @@ def GenerateSM80_TensorOp_16816_mixed_input_upcast_b(manifest, cuda_version):
       MathOperation.multiply_add_mixed_input_upcast),
     MathInstruction(                                  \
       [16, 8, 16],                                    \
-      DataType.bf16, DataType.s8, DataType.f32,       \
+      DataType.f16, DataType.u8, DataType.f32,        \
       OpcodeClass.TensorOp,                           \
       MathOperation.multiply_add_mixed_input_upcast),
     MathInstruction(                                  \
       [16, 8, 16],                                    \
-      DataType.f16, DataType.u8, DataType.f32,        \
+      DataType.bf16, DataType.s8, DataType.f32,       \
       OpcodeClass.TensorOp,                           \
       MathOperation.multiply_add_mixed_input_upcast),
     MathInstruction(                                  \
@@ -2685,6 +2738,16 @@ def GenerateSM80_TensorOp_16816_mixed_input_upcast_b(manifest, cuda_version):
       DataType.bf16, DataType.u8, DataType.f32,       \
       OpcodeClass.TensorOp,                           \
       MathOperation.multiply_add_mixed_input_upcast),
+    MathInstruction(                                  \
+      [16, 8, 16],                                    \
+      DataType.f16, DataType.s8, DataType.f16,        \
+      OpcodeClass.TensorOp,                           \
+      MathOperation.multiply_add_mixed_input_upcast),
+    MathInstruction(                                  \
+      [16, 8, 16],                                    \
+      DataType.f16, DataType.u8, DataType.f16,        \
+      OpcodeClass.TensorOp,                           \
+      MathOperation.multiply_add_mixed_input_upcast),
   ]
 
   min_cc = 80
@@ -2728,7 +2791,7 @@ def GenerateSM80_TensorOp_16816_mixed_input_upcast_b(manifest, cuda_version):
     ]
 
     # streamk uses more regs which can cause spill for the biggest warp tile size when the accumulators are 32bit.
-    CreateGemmOperator(manifest, layouts, tile_descriptions, \
+    operations = CreateGemmOperator(manifest, layouts, tile_descriptions, \
       data_type, alignment_constraints, None, EpilogueFunctor.LinearCombination, SwizzlingFunctor.Identity8)
 
     # Avoid emitting two kernels if the accumulator type does not differ from the input type (e.g. F16 accumulation)
@@ -2741,12 +2804,12 @@ def GenerateSM80_TensorOp_16816_mixed_input_upcast_b(manifest, cuda_version):
         math_inst.element_accumulator,
       ]
 
-      operations = CreateGemmOperator(manifest, layouts, tile_descriptions, \
+      operations += CreateGemmOperator(manifest, layouts, tile_descriptions, \
         data_type_mixed, alignment_constraints, None, EpilogueFunctor.LinearCombination, SwizzlingFunctor.Identity8)
 
-      for op in operations:
-        if op.tile_description.threadblock_shape[1] <= 32:
-          op.C.alignment = 4
+    for op in operations:
+      if op.tile_description.threadblock_shape[1] <= 32:
+        op.C.alignment = 4
 
 #
 def GenerateSM80_TensorOp_16832_TN(manifest, cuda_version):
@@ -2841,6 +2904,170 @@ def GenerateSM80_TensorOp_16832_TN(manifest, cuda_version):
 
 #
 
+def GenerateSM80_TensorOp_16832_TN_mixed_input_upcast_a(manifest, cuda_version):
+
+  if not CudaToolkitVersionSatisfies(cuda_version, 11, 0):
+    return
+
+  layouts = [
+    (LayoutType.RowMajor, LayoutType.ColumnMajor, LayoutType.ColumnMajor),
+  ]
+
+  # Upcast on Operand A
+  math_instructions = [
+    MathInstruction(                                  \
+      [16, 8, 32],                                    \
+      DataType.s4, DataType.s8, DataType.s32,         \
+      OpcodeClass.TensorOp,                           \
+      MathOperation.multiply_add_mixed_input_upcast),
+  ]
+
+  min_cc = 80
+  max_cc = 1024
+
+  # For mixed-input alignment constraints are a list of lists, where the 
+  # inner list contains the alignment constraints for operands/matrices 
+  # [[alignA, alignB, alignC],..]
+  alignment_constraints = [[32, 16, 4],]
+
+  for math_inst in math_instructions:
+    tile_descriptions = [
+      TileDescription([256, 128,  64],  3, [4, 2, 1], math_inst, min_cc, max_cc),
+      TileDescription([128, 256,  64],  3, [2, 4, 1], math_inst, min_cc, max_cc),
+      TileDescription([256,  64,  64],  4, [4, 1, 1], math_inst, min_cc, max_cc),
+      TileDescription([ 64, 256,  64],  4, [1, 4, 1], math_inst, min_cc, max_cc),
+      TileDescription([ 32, 256,  64],  4, [1, 4, 1], math_inst, min_cc, max_cc),
+      TileDescription([128, 128,  64],  5, [2, 2, 1], math_inst, min_cc, max_cc),
+      TileDescription([ 64, 128,  64],  6, [2, 2, 1], math_inst, min_cc, max_cc),
+      TileDescription([256, 128, 128],  3, [4, 2, 1], math_inst, min_cc, max_cc),
+      TileDescription([128, 256, 128],  3, [2, 4, 1], math_inst, min_cc, max_cc),
+      TileDescription([256,  64, 128],  4, [4, 1, 1], math_inst, min_cc, max_cc),
+      TileDescription([ 64, 256, 128],  4, [1, 4, 1], math_inst, min_cc, max_cc),
+      TileDescription([256,  32, 128],  4, [4, 1, 1], math_inst, min_cc, max_cc),
+      TileDescription([ 32, 256, 128],  4, [1, 4, 1], math_inst, min_cc, max_cc),
+      TileDescription([128, 128, 128],  4, [2, 2, 1], math_inst, min_cc, max_cc),
+      TileDescription([ 64, 128, 128],  3, [2, 2, 1], math_inst, min_cc, max_cc),
+      TileDescription([128,  32, 128],  4, [4, 1, 1], math_inst, min_cc, max_cc),
+    ]
+
+    data_type = [
+      math_inst.element_a,
+      math_inst.element_b,
+      math_inst.element_accumulator,
+      math_inst.element_accumulator,
+    ]
+
+    # streamk uses more regs which can cause spill for the biggest warp tile size when the accumulators are 32bit.
+    operations = CreateGemmOperator(manifest, layouts, tile_descriptions, \
+      data_type, alignment_constraints, None, EpilogueFunctor.LinearCombination, SwizzlingFunctor.Identity8)
+
+    # Avoid emitting two kernels if the accumulator type does not differ from the input type (e.g. S8 accumulation)
+    if math_inst.element_a != math_inst.element_accumulator:
+      alignment_constraints = [[32, 16, 16],]
+
+      data_type_mixed = [
+        math_inst.element_a,
+        math_inst.element_b,
+        math_inst.element_b,
+        DataType.f32
+      ]
+
+      operations += CreateGemmOperator(manifest, layouts, tile_descriptions, \
+        data_type_mixed, alignment_constraints, None, EpilogueFunctor.LinearCombinationClamp, SwizzlingFunctor.Identity8)
+
+    for op in operations:
+      if op.tile_description.threadblock_shape[1] >= 128:
+        if op.tile_description.threadblock_shape[0] == 32:
+          op.C.alignment = 8
+        else:
+          op.C.alignment = 16
+      else:
+        op.C.alignment = 8
+#
+
+#
+def GenerateSM80_TensorOp_16832_TN_mixed_input_upcast_b(manifest, cuda_version):
+
+  if not CudaToolkitVersionSatisfies(cuda_version, 11, 0):
+    return
+
+  layouts = [
+    (LayoutType.RowMajor, LayoutType.ColumnMajor, LayoutType.ColumnMajor),
+  ]
+
+  # Upcast on Operand B
+  math_instructions = [
+    MathInstruction(                                  \
+      [16, 8, 32],                                    \
+      DataType.s8, DataType.s4, DataType.s32,         \
+      OpcodeClass.TensorOp,                           \
+      MathOperation.multiply_add_mixed_input_upcast),
+  ]
+
+  min_cc = 80
+  max_cc = 1024
+
+  # For mixed-input alignment constraints are a list of lists, where the 
+  # inner list contains the alignment constraints for operands/matrices 
+  # [[alignA, alignB, alignC],..]
+  alignment_constraints = [[16, 32, 4],]
+
+  for math_inst in math_instructions:
+    tile_descriptions = [
+      TileDescription([256, 128,  64],  3, [4, 2, 1], math_inst, min_cc, max_cc),
+      TileDescription([128, 256,  64],  3, [2, 4, 1], math_inst, min_cc, max_cc),
+      TileDescription([256,  64,  64],  4, [4, 1, 1], math_inst, min_cc, max_cc),
+      TileDescription([ 64, 256,  64],  4, [1, 4, 1], math_inst, min_cc, max_cc),
+      TileDescription([256,  32,  64],  4, [4, 1, 1], math_inst, min_cc, max_cc),
+      TileDescription([128, 128,  64],  5, [2, 2, 1], math_inst, min_cc, max_cc),
+      TileDescription([ 64, 128,  64],  6, [2, 2, 1], math_inst, min_cc, max_cc),
+      TileDescription([128,  32,  64],  6, [4, 1, 1], math_inst, min_cc, max_cc),
+      TileDescription([256, 128, 128],  3, [4, 2, 1], math_inst, min_cc, max_cc),
+      TileDescription([128, 256, 128],  3, [2, 4, 1], math_inst, min_cc, max_cc),
+      TileDescription([256,  64, 128],  4, [4, 1, 1], math_inst, min_cc, max_cc),
+      TileDescription([ 64, 256, 128],  4, [1, 4, 1], math_inst, min_cc, max_cc),
+      TileDescription([256,  32, 128],  4, [4, 1, 1], math_inst, min_cc, max_cc),
+      TileDescription([ 32, 256, 128],  4, [1, 4, 1], math_inst, min_cc, max_cc),
+      TileDescription([128, 128, 128],  4, [2, 2, 1], math_inst, min_cc, max_cc),
+      TileDescription([ 64, 128, 128],  3, [2, 2, 1], math_inst, min_cc, max_cc),
+      TileDescription([128,  32, 128],  4, [4, 1, 1], math_inst, min_cc, max_cc),
+    ]
+
+    data_type = [
+      math_inst.element_a,
+      math_inst.element_b,
+      math_inst.element_accumulator,
+      math_inst.element_accumulator,
+    ]
+
+    # streamk uses more regs which can cause spill for the biggest warp tile size when the accumulators are 32bit.
+    operations = CreateGemmOperator(manifest, layouts, tile_descriptions, \
+      data_type, alignment_constraints, None, EpilogueFunctor.LinearCombination, SwizzlingFunctor.Identity8)
+
+    # Avoid emitting two kernels if the accumulator type does not differ from the input type (e.g. S8 accumulation)
+    if math_inst.element_a != math_inst.element_accumulator:
+      alignment_constraints = [[16, 32, 16],]
+
+      data_type_mixed = [
+        math_inst.element_a,
+        math_inst.element_b,
+        math_inst.element_a,
+        DataType.f32,
+      ]
+
+      operations += CreateGemmOperator(manifest, layouts, tile_descriptions, \
+        data_type_mixed, alignment_constraints, None, EpilogueFunctor.LinearCombinationClamp, SwizzlingFunctor.Identity8)
+
+    for op in operations:
+      if op.tile_description.threadblock_shape[1] >= 128:
+        if op.tile_description.threadblock_shape[0] == 32:
+          op.C.alignment = 8
+        else:
+          op.C.alignment = 16
+      else:
+        op.C.alignment = 8
+#
+
 #
 def GenerateSM80_SparseTensorOp_16864_TN(manifest, cuda_version):
 
@@ -4684,6 +4911,8 @@ def GenerateSM80(manifest, cuda_version):
   GenerateSM80_TensorOp_16816_mixed_input_upcast_a(manifest, cuda_version)
   GenerateSM80_TensorOp_16816_mixed_input_upcast_b(manifest, cuda_version)
   GenerateSM80_TensorOp_16832_TN(manifest, cuda_version)
+  GenerateSM80_TensorOp_16832_TN_mixed_input_upcast_a(manifest, cuda_version)
+  GenerateSM80_TensorOp_16832_TN_mixed_input_upcast_b(manifest, cuda_version)
   GenerateSM80_SparseTensorOp_16864_TN(manifest, cuda_version)
   GenerateSM80_TensorOp_16832_Interleaved(manifest, cuda_version)
   GenerateSM80_TensorOp_16864_TN(manifest, cuda_version)
@@ -4703,7 +4932,8 @@ def GenerateSM89_TensorOp_16832_fp8(manifest, cuda_version):
     return
 
   layouts = [
-    (LayoutType.RowMajor, LayoutType.ColumnMajor, LayoutType.ColumnMajor)
+    (LayoutType.RowMajor, LayoutType.ColumnMajor, LayoutType.ColumnMajor),
+    (LayoutType.RowMajor, LayoutType.ColumnMajor, LayoutType.RowMajor)
   ]
 
   math_instructions = [
@@ -4757,43 +4987,49 @@ def GenerateSM89_TensorOp_16832_fp8(manifest, cuda_version):
 
   for math_inst in math_instructions:
     tile_descriptions = [
+      TileDescription([256, 128, 128],  3, [4, 2, 1], math_inst, min_cc, max_cc),
       TileDescription([256, 128,  64],  3, [4, 2, 1], math_inst, min_cc, max_cc),
-      TileDescription([128, 256,  64],  3, [2, 4, 1], math_inst, min_cc, max_cc),
       TileDescription([256, 128,  64],  6, [4, 2, 1], math_inst, min_cc, max_cc),
+      TileDescription([128, 256, 128],  3, [2, 4, 1], math_inst, min_cc, max_cc),
+      TileDescription([128, 256,  64],  3, [2, 4, 1], math_inst, min_cc, max_cc),
       TileDescription([128, 256,  64],  6, [2, 4, 1], math_inst, min_cc, max_cc),
+      TileDescription([256,  64, 128],  4, [4, 1, 1], math_inst, min_cc, max_cc),
       TileDescription([256,  64,  64],  3, [4, 1, 1], math_inst, min_cc, max_cc),
-      TileDescription([ 64, 256,  64],  3, [1, 4, 1], math_inst, min_cc, max_cc),
       TileDescription([256,  64,  64],  4, [4, 1, 1], math_inst, min_cc, max_cc),
+      TileDescription([ 64, 256, 128],  4, [1, 4, 1], math_inst, min_cc, max_cc),
+      TileDescription([ 64, 256,  64],  3, [1, 4, 1], math_inst, min_cc, max_cc),
       TileDescription([ 64, 256,  64],  4, [1, 4, 1], math_inst, min_cc, max_cc),
+      TileDescription([256,  32, 128],  4, [4, 1, 1], math_inst, min_cc, max_cc),
       TileDescription([256,  32,  64],  4, [4, 1, 1], math_inst, min_cc, max_cc),
+      TileDescription([ 32, 256, 128],  4, [1, 4, 1], math_inst, min_cc, max_cc),
       TileDescription([ 32, 256,  64],  4, [1, 4, 1], math_inst, min_cc, max_cc),
+      TileDescription([128, 128, 128],  3, [2, 2, 1], math_inst, min_cc, max_cc),
+      TileDescription([128, 128, 128],  4, [2, 2, 1], math_inst, min_cc, max_cc),
+      TileDescription([128, 128, 128],  5, [2, 2, 1], math_inst, min_cc, max_cc),
       TileDescription([128, 128,  64],  3, [2, 2, 1], math_inst, min_cc, max_cc),
       TileDescription([128, 128,  64],  4, [2, 2, 1], math_inst, min_cc, max_cc),
       TileDescription([128, 128,  64],  5, [2, 2, 1], math_inst, min_cc, max_cc),
       TileDescription([128, 128,  64],  6, [2, 2, 1], math_inst, min_cc, max_cc),
+      TileDescription([128,  64, 128],  3, [2, 2, 1], math_inst, min_cc, max_cc),
+      TileDescription([128,  64, 128],  4, [2, 2, 1], math_inst, min_cc, max_cc),
+      TileDescription([ 64, 128, 128],  3, [2, 2, 1], math_inst, min_cc, max_cc),
+      TileDescription([ 64, 128, 128],  4, [2, 2, 1], math_inst, min_cc, max_cc),
+      TileDescription([128,  64,  64],  3, [2, 2, 1], math_inst, min_cc, max_cc),
+      TileDescription([128,  64,  64],  4, [2, 2, 1], math_inst, min_cc, max_cc),
+      TileDescription([128,  64,  64],  5, [2, 2, 1], math_inst, min_cc, max_cc),
       TileDescription([128,  64,  64],  6, [2, 2, 1], math_inst, min_cc, max_cc),
+      TileDescription([ 64, 128,  64],  3, [2, 2, 1], math_inst, min_cc, max_cc),
+      TileDescription([ 64, 128,  64],  4, [2, 2, 1], math_inst, min_cc, max_cc),
+      TileDescription([ 64, 128,  64],  5, [2, 2, 1], math_inst, min_cc, max_cc),
       TileDescription([ 64, 128,  64],  6, [2, 2, 1], math_inst, min_cc, max_cc),
+      TileDescription([128,  32, 128],  4, [4, 1, 1], math_inst, min_cc, max_cc),
       TileDescription([128,  32,  64],  6, [4, 1, 1], math_inst, min_cc, max_cc),
+      TileDescription([ 32, 128, 128],  4, [1, 4, 1], math_inst, min_cc, max_cc),
       TileDescription([ 32, 128,  64],  6, [1, 4, 1], math_inst, min_cc, max_cc),
+      TileDescription([ 64,  64, 128],  5, [2, 2, 1], math_inst, min_cc, max_cc),
+      TileDescription([ 64,  64, 128],  6, [2, 2, 1], math_inst, min_cc, max_cc),
       TileDescription([ 64,  64,  64],  6, [2, 2, 1], math_inst, min_cc, max_cc),
       TileDescription([ 64,  64,  64], 10, [2, 2, 1], math_inst, min_cc, max_cc),
-      TileDescription([256, 128, 128],  3, [4, 2, 1], math_inst, min_cc, max_cc),
-      TileDescription([128, 256, 128],  3, [2, 4, 1], math_inst, min_cc, max_cc),
-      TileDescription([256,  64, 128],  4, [4, 1, 1], math_inst, min_cc, max_cc),
-      TileDescription([ 64, 256, 128],  4, [1, 4, 1], math_inst, min_cc, max_cc),
-      TileDescription([256,  32, 128],  4, [4, 1, 1], math_inst, min_cc, max_cc),
-      TileDescription([ 32, 256, 128],  4, [1, 4, 1], math_inst, min_cc, max_cc),
-      TileDescription([128, 128, 128],  3, [2, 2, 1], math_inst, min_cc, max_cc),
-      TileDescription([128, 128, 128],  4, [2, 2, 1], math_inst, min_cc, max_cc),
-      TileDescription([128, 128, 128],  5, [2, 2, 1], math_inst, min_cc, max_cc),
-      TileDescription([128,  64, 128],  3, [2, 2, 1], math_inst, min_cc, max_cc),
-      TileDescription([ 64, 128, 128],  3, [2, 2, 1], math_inst, min_cc, max_cc),
-      TileDescription([128,  64, 128],  4, [2, 2, 1], math_inst, min_cc, max_cc),
-      TileDescription([ 64, 128, 128],  4, [2, 2, 1], math_inst, min_cc, max_cc),
-      TileDescription([128,  32, 128],  4, [4, 1, 1], math_inst, min_cc, max_cc),
-      TileDescription([ 32, 128, 128],  4, [1, 4, 1], math_inst, min_cc, max_cc),
-      TileDescription([ 64,  64, 128],  5, [2, 2, 1], math_inst, min_cc, max_cc),
-      TileDescription([ 64,  64, 128],  6, [2, 2, 1], math_inst, min_cc, max_cc),
     ]
 
     data_types = [
@@ -4803,6 +5039,12 @@ def GenerateSM89_TensorOp_16832_fp8(manifest, cuda_version):
         DataType.f32,
         math_inst.element_accumulator
       ],
+      [
+        math_inst.element_a,
+        math_inst.element_b,
+        DataType.bf16,
+        math_inst.element_accumulator
+      ],
     ]
 
     operations = []
@@ -4931,11 +5173,39 @@ def GenerateSM89(manifest, cuda_version):
 
 ###################################################################################################
 
-#
+
+try:
+    from .sm90_utils import (
+        generate_fp16_bf16_math_instructions_sm90,
+        generate_tf32_math_instructions_sm90,
+        generate_int8_math_instructions_sm90,
+        generate_fp8_math_instructions_sm90,
+        make_sparse_math_instructions,
+        generate_tile_descriptions_sm90,
+        get_valid_schedules,
+        generate_data_types_from_math_instruction,
+        fix_alignments,
+    )
+except ImportError:
+    from sm90_utils import (
+        generate_fp16_bf16_math_instructions_sm90,
+        generate_tf32_math_instructions_sm90,
+        generate_int8_math_instructions_sm90,
+        generate_fp8_math_instructions_sm90,
+        make_sparse_math_instructions,
+        generate_tile_descriptions_sm90,
+        get_valid_schedules,
+        generate_data_types_from_math_instruction,
+        fix_alignments,
+    )
+
 def GenerateSM90_TensorOp_16b_WGMMA_gemm(manifest, cuda_version):
   if not CudaToolkitVersionSatisfies(cuda_version, 12, 0):
     return
 
+  instantiation_level = manifest.get_sm90_instantiation_level(pruned_level=100, default_level=131, exhaustive_level=9999)
+  is_aligned = True
+
   # layouts for ABC and their alignments.
   layouts = [
     [[LayoutType.ColumnMajor, 8], [LayoutType.ColumnMajor, 8], [LayoutType.ColumnMajor, 1]],
@@ -4944,179 +5214,63 @@ def GenerateSM90_TensorOp_16b_WGMMA_gemm(manifest, cuda_version):
     [[LayoutType.RowMajor,    8], [LayoutType.RowMajor,    8], [LayoutType.ColumnMajor, 1]],
   ]
 
-  math_instructions = [
-    MathInstruction(
-      [64, 128, 16],
-      DataType.f16, DataType.f16, DataType.f16,
-      OpcodeClass.TensorOp,
-      MathOperation.multiply_add),
-    MathInstruction(
-      [64, 128, 16],
-      DataType.f16, DataType.f16, DataType.f32,
-      OpcodeClass.TensorOp,
-      MathOperation.multiply_add),
-    MathInstruction(
-      [64, 128, 16],
-      DataType.bf16, DataType.bf16, DataType.f32,
-      OpcodeClass.TensorOp,
-      MathOperation.multiply_add),
-  ]
-
-  min_cc = 90
-  max_cc = 90
+  math_instructions = generate_fp16_bf16_math_instructions_sm90(instantiation_level)
+  tile_descriptions = generate_tile_descriptions_sm90(
+      math_instructions=math_instructions,
+      is_aligned=is_aligned,
+      level=instantiation_level)
 
-  for math_inst in math_instructions:
-    tile_descriptions_small = [
-      # Not compatible with TmaWarpSpecializedCooperative
-      TileDescription([math_inst.instruction_shape[0], math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
-       0, [4, 1, 1], math_inst, min_cc, max_cc, [2,1,1]),
-      TileDescription([math_inst.instruction_shape[0], math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
-       0, [4, 1, 1], math_inst, min_cc, max_cc, [1,2,1]),
-      TileDescription([math_inst.instruction_shape[0], math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
-       0, [4, 1, 1], math_inst, min_cc, max_cc, [1,1,1]),
-    ]
-    tile_descriptions_medium = [
-      TileDescription([math_inst.instruction_shape[0]*2, math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
-        0, [4, 1, 1], math_inst, min_cc, max_cc, [2,1,1]),
-      TileDescription([math_inst.instruction_shape[0]*2, math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
-        0, [4, 1, 1], math_inst, min_cc, max_cc, [1,2,1]),
-      TileDescription([math_inst.instruction_shape[0]*2, math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
-        0, [4, 1, 1], math_inst, min_cc, max_cc, [1,1,1]),
-    ]
-    tile_descriptions_large = [
-      TileDescription([math_inst.instruction_shape[0]*4, math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
-        0, [4, 1, 1], math_inst, min_cc, max_cc, [2,1,1]),
-      TileDescription([math_inst.instruction_shape[0]*4, math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
-        0, [4, 1, 1], math_inst, min_cc, max_cc, [1,2,1]),
-      TileDescription([math_inst.instruction_shape[0]*4, math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
-        0, [4, 1, 1], math_inst, min_cc, max_cc, [1,1,1]),
-      TileDescription([math_inst.instruction_shape[0]*2, math_inst.instruction_shape[1]*2, math_inst.instruction_shape[2]*4],
-        0, [4, 2, 1], math_inst, min_cc, max_cc, [2,1,1]),
-      TileDescription([math_inst.instruction_shape[0]*2, math_inst.instruction_shape[1]*2, math_inst.instruction_shape[2]*4],
-        0, [4, 2, 1], math_inst, min_cc, max_cc, [1,2,1]),
-      TileDescription([math_inst.instruction_shape[0]*2, math_inst.instruction_shape[1]*2, math_inst.instruction_shape[2]*4],
-        0, [4, 2, 1], math_inst, min_cc, max_cc, [1,1,1]),
-      # 128x256x128
-      TileDescription([math_inst.instruction_shape[0]*2, math_inst.instruction_shape[1]*2, math_inst.instruction_shape[2]*4],
-        0, [4, 2, 1], math_inst, min_cc, max_cc, [1,1,1]),
-    ]
-    tile_descriptions = tile_descriptions_medium + tile_descriptions_large
-
-    data_type = {
-      "a_type"   : math_inst.element_a,
-      "b_type"   : math_inst.element_b,
-      "c_type"   : math_inst.element_accumulator,
-      "d_type"   : math_inst.element_accumulator,
-      "acc_type" : math_inst.element_accumulator,
-      "epi_type" : math_inst.element_accumulator
-    }
-
-    # Set alignment c based on Destination format.
-    for layout in layouts:
-      if data_type["c_type"] in [DataType.s32, DataType.f32]:
-        layout[2][1] = 4
-      elif data_type["c_type"] in [DataType.f16, DataType.bf16]:
-        layout[2][1] = 8
-
-    if CudaToolkitVersionSatisfies(cuda_version, 12, 1):
-      schedules = [
-        [KernelScheduleType.ScheduleAuto, EpilogueScheduleType.ScheduleAuto],
-        [KernelScheduleType.TmaWarpSpecializedCooperative, EpilogueScheduleType.NoSmemWarpSpecialized],
-        [KernelScheduleType.TmaWarpSpecializedPingpong, EpilogueScheduleType.NoSmemWarpSpecialized],
-        [KernelScheduleType.TmaWarpSpecialized, EpilogueScheduleType.NoSmemWarpSpecialized]
-      ]
-      stream_k_schedules = [[KernelScheduleType.TmaWarpSpecializedCooperative, EpilogueScheduleType.NoSmemWarpSpecialized]]
-    else:
-      schedules = [
-        [KernelScheduleType.ScheduleAuto, EpilogueScheduleType.ScheduleAuto],
-        [KernelScheduleType.TmaWarpSpecialized, EpilogueScheduleType.NoSmemWarpSpecialized]
-        # TmaWarpSpecializedCooperative and TmaWarpSpecializedPingpong require CUDA version >= 12.1 for optimal performance.
-      ]
-      stream_k_schedules = []
-
-    CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type, schedules)
-
-    if CudaToolkitVersionSatisfies(cuda_version, 12, 1):
-      # Add stream-K variants
-      CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type, stream_k_schedules, tile_schedulers=[TileSchedulerType.StreamK])
-
-    # persistent kernels with TMA epilogues
-    if CudaToolkitVersionSatisfies(cuda_version, 12, 1):
-      # not enough smem for 256x128 f32 out with C allocation
-      if data_type["d_type"] == DataType.f32:
-        CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions_medium, data_type,
-          [[KernelScheduleType.TmaWarpSpecializedPingpong,    EpilogueScheduleType.TmaWarpSpecialized],
-           [KernelScheduleType.TmaWarpSpecializedCooperative, EpilogueScheduleType.TmaWarpSpecializedCooperative]])
-
-        CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions_medium, data_type,
-          [[KernelScheduleType.TmaWarpSpecializedCooperative, EpilogueScheduleType.TmaWarpSpecializedCooperative]],
-          tile_schedulers=[TileSchedulerType.StreamK])
-      else:
-        CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type,
-          [[KernelScheduleType.TmaWarpSpecializedPingpong,    EpilogueScheduleType.TmaWarpSpecialized],
-           [KernelScheduleType.TmaWarpSpecializedCooperative, EpilogueScheduleType.TmaWarpSpecializedCooperative]])
-
-        CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type,
-          [[KernelScheduleType.TmaWarpSpecializedCooperative, EpilogueScheduleType.TmaWarpSpecializedCooperative]],
-          tile_schedulers=[TileSchedulerType.StreamK])
-
-      # Emit instance without C allocation + load
-      data_type["c_type"] = DataType.void
-      CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type,
-        [[KernelScheduleType.TmaWarpSpecializedPingpong,    EpilogueScheduleType.TmaWarpSpecialized],
-         [KernelScheduleType.TmaWarpSpecializedCooperative, EpilogueScheduleType.TmaWarpSpecializedCooperative]])
-
-      CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type,
-        [[KernelScheduleType.TmaWarpSpecializedCooperative, EpilogueScheduleType.TmaWarpSpecializedCooperative]],
-        tile_schedulers=[TileSchedulerType.StreamK])
+  for tile_desc in tile_descriptions:
+    math_inst = tile_desc.math_instruction
+    data_type_w_source = generate_data_types_from_math_instruction(math_inst)
+    data_type_wo_source = generate_data_types_from_math_instruction(math_inst, element_source=DataType.void)
+    data_types = [data_type_w_source, data_type_wo_source]
 
     # for mixed precision kernels, also generate kernels that write output matrix in the A/B format
     # Avoid emitting two kernels if the accumulator type does not differ from the input type (e.g. F16 accumulation)
     if math_inst.element_a != math_inst.element_accumulator:
-      data_type_mixed = {
-        "a_type"   : math_inst.element_a,
-        "b_type"   : math_inst.element_b,
-        "c_type"   : math_inst.element_a,
-        "d_type"   : math_inst.element_a,
-        "acc_type" : math_inst.element_accumulator,
-        "epi_type" : math_inst.element_accumulator
-      }
-
-      # Set alignment c based on Destination format.
-      for layout in layouts:
-        if data_type_mixed["c_type"] in [DataType.s32, DataType.f32]:
-          layout[2][1] = 4
-        elif data_type_mixed["c_type"] in [DataType.f16, DataType.bf16]:
-          layout[2][1] = 8
-
-      CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type_mixed, schedules)
-      CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type_mixed, stream_k_schedules, tile_schedulers=[TileSchedulerType.StreamK])
-
-      # persistent kernels with TMA epilogues
-      if CudaToolkitVersionSatisfies(cuda_version, 12, 1):
-        CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type_mixed,
-          [[KernelScheduleType.TmaWarpSpecializedPingpong,    EpilogueScheduleType.TmaWarpSpecialized],
-           [KernelScheduleType.TmaWarpSpecializedCooperative, EpilogueScheduleType.TmaWarpSpecializedCooperative]])
-
-        CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type_mixed,
-          [[KernelScheduleType.TmaWarpSpecializedCooperative, EpilogueScheduleType.TmaWarpSpecializedCooperative]],
-          tile_schedulers=[TileSchedulerType.StreamK])
-
-        # Emit instance without C allocation+load
-        data_type_mixed["c_type"] = DataType.void
-        CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type_mixed,
-          [[KernelScheduleType.TmaWarpSpecializedPingpong,    EpilogueScheduleType.TmaWarpSpecialized],
-           [KernelScheduleType.TmaWarpSpecializedCooperative, EpilogueScheduleType.TmaWarpSpecializedCooperative]])
-
-        CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type_mixed,
-          [[KernelScheduleType.TmaWarpSpecializedCooperative, EpilogueScheduleType.TmaWarpSpecializedCooperative]],
-          tile_schedulers=[TileSchedulerType.StreamK])
+        data_type_mixed_w_source = generate_data_types_from_math_instruction(
+            math_inst,
+            element_source=math_inst.element_a,
+            element_dest=math_inst.element_a
+        )
+        data_type_mixed_wo_source = generate_data_types_from_math_instruction(
+            math_inst,
+            element_source=DataType.void,
+            element_dest=math_inst.element_a
+        )
+        data_types.append(data_type_mixed_w_source)
+        data_types.append(data_type_mixed_wo_source)
+
+    for layout in layouts:
+        for data_type in data_types:
+            layout = fix_alignments(data_type, layout, alignment_bits=128)
+
+            schedules, stream_k_schedules = get_valid_schedules(
+              tile_description=tile_desc,
+              cuda_version=cuda_version,
+              is_aligned=is_aligned,
+              data_types=data_type,
+              instantiation_level=instantiation_level,
+              layout=layout,
+            )
+
+            if len(schedules):
+              CreateGemmUniversal3xOperator(manifest, [layout], [tile_desc], data_type, schedules)
+              if len(stream_k_schedules):
+                assert CudaToolkitVersionSatisfies(cuda_version, 12, 1)
+                CreateGemmUniversal3xOperator(manifest, [layout], [tile_desc], data_type,
+                                              stream_k_schedules,
+                                              tile_schedulers=[TileSchedulerType.StreamK])
+
 
-#
 def GenerateSM90_TensorOp_16b_WGMMA_alignx_gemm(manifest, cuda_version):
   if not CudaToolkitVersionSatisfies(cuda_version, 12, 0):
     return
 
+  instantiation_level = manifest.get_sm90_instantiation_level(pruned_level=100, default_level=101, exhaustive_level=9999)
+  is_aligned = False
+
   # layouts for ABC and their alignments.
   layouts = [
     [[LayoutType.RowMajor,    4], [LayoutType.ColumnMajor, 4], [LayoutType.ColumnMajor, 1]],
@@ -5129,266 +5283,178 @@ def GenerateSM90_TensorOp_16b_WGMMA_alignx_gemm(manifest, cuda_version):
     [[LayoutType.ColumnMajor, 2], [LayoutType.RowMajor,    2], [LayoutType.ColumnMajor, 1]],
   ]
 
-  math_instructions = [
-    MathInstruction(
-      [64, 128, 16],
-      DataType.f16, DataType.f16, DataType.f16,
-      OpcodeClass.TensorOp,
-      MathOperation.multiply_add),
-    MathInstruction(
-      [64, 128, 16],
-      DataType.f16, DataType.f16, DataType.f32,
-      OpcodeClass.TensorOp,
-      MathOperation.multiply_add),
-    MathInstruction(
-      [64, 128, 16],
-      DataType.bf16, DataType.bf16, DataType.f32,
-      OpcodeClass.TensorOp,
-      MathOperation.multiply_add),
-  ]
+  math_instructions = generate_fp16_bf16_math_instructions_sm90(instantiation_level)
+  tile_descriptions = generate_tile_descriptions_sm90(
+      math_instructions=math_instructions,
+      is_aligned=is_aligned,
+      level=instantiation_level)
 
-  min_cc = 90
-  max_cc = 90
+  for tile_desc in tile_descriptions:
+    math_inst = tile_desc.math_instruction
+    data_type_w_source = generate_data_types_from_math_instruction(math_inst)
+    data_types = [data_type_w_source]
 
-  for math_inst in math_instructions:
-    tile_descriptions_small = [
-      # TileDescription([math_inst.instruction_shape[0], math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
-      #   0, [4, 1, 1], math_inst, min_cc, max_cc, [1,1,1]),
-    ]
-    tile_descriptions_medium = [
-      TileDescription([math_inst.instruction_shape[0]*2, math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
-        0, [4, 1, 1], math_inst, min_cc, max_cc, [1,1,1]),
-      # TileDescription([math_inst.instruction_shape[0], math_inst.instruction_shape[1]*2, math_inst.instruction_shape[2]*4],
-      #   0, [4, 1, 1], math_inst, min_cc, max_cc, [1,1,1]),
-    ]
-    tile_descriptions = tile_descriptions_small + tile_descriptions_medium
-
-    data_type = {
-      "a_type"   : math_inst.element_a,
-      "b_type"   : math_inst.element_b,
-      "c_type"   : math_inst.element_accumulator,
-      "d_type"   : math_inst.element_accumulator,
-      "acc_type" : math_inst.element_accumulator,
-      "epi_type" : math_inst.element_accumulator
-    }
+    # for mixed precision kernels, also generate kernels that write output matrix in the A/B format
+    # Avoid emitting two kernels if the accumulator type does not differ from the input type (e.g. F16 accumulation)
+    if math_inst.element_a != math_inst.element_accumulator:
+        data_type_mixed_w_source = generate_data_types_from_math_instruction(
+            math_inst,
+            element_source=math_inst.element_a,
+            element_dest=math_inst.element_a
+        )
+        data_types.append(data_type_mixed_w_source)
 
-    # Set alignment c based on Destination format.
     for layout in layouts:
-      if data_type["c_type"] in [DataType.s32, DataType.f32]:
-        layout[2][1] = 4
-      elif data_type["c_type"] in [DataType.f16, DataType.bf16]:
-        layout[2][1] = 8
-
-    schedules = [
-      # [KernelScheduleType.ScheduleAuto, EpilogueScheduleType.ScheduleAuto],
-      [KernelScheduleType.CpAsyncWarpSpecialized, EpilogueScheduleType.NoSmemWarpSpecialized]
-    ]
-    stream_k_schedules = []
-
-    if CudaToolkitVersionSatisfies(cuda_version, 12, 1):
-      schedules += [
-        [KernelScheduleType.CpAsyncWarpSpecializedCooperative, EpilogueScheduleType.NoSmemWarpSpecialized],
-        # [KernelScheduleType.CpAsyncWarpSpecializedPingpong, EpilogueScheduleType.NoSmemWarpSpecialized]
-      ]
-      stream_k_schedules += [[KernelScheduleType.CpAsyncWarpSpecializedCooperative, EpilogueScheduleType.NoSmemWarpSpecialized]]
-
-    CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type, schedules)
-
-    if CudaToolkitVersionSatisfies(cuda_version, 12, 1):
-      # Add stream-K variants
-      CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type, stream_k_schedules, tile_schedulers=[TileSchedulerType.StreamK])
+        for data_type in data_types:
+            layout = fix_alignments(data_type, layout, alignment_bits=128)
+
+            schedules, stream_k_schedules = get_valid_schedules(
+              tile_description=tile_desc,
+              cuda_version=cuda_version,
+              is_aligned=is_aligned,
+              data_types=data_type,
+              instantiation_level=instantiation_level,
+              layout=layout,
+            )
+
+            if len(schedules):
+              CreateGemmUniversal3xOperator(manifest, [layout], [tile_desc], data_type, schedules)
+              if len(stream_k_schedules):
+                assert CudaToolkitVersionSatisfies(cuda_version, 12, 1)
+                CreateGemmUniversal3xOperator(manifest, [layout], [tile_desc], data_type,
+                                              stream_k_schedules,
+                                              tile_schedulers=[TileSchedulerType.StreamK])
+
+def GenerateSM90_SparseTensorOp_16b_WGMMA_gemm(manifest, cuda_version):
+  if not CudaToolkitVersionSatisfies(cuda_version, 12, 2):
+    return
 
-    # persistent kernels with TMA epilogues
-    # if CudaToolkitVersionSatisfies(cuda_version, 12, 1):
-    #   CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type,
-    #     [[KernelScheduleType.CpAsyncWarpSpecializedPingpong,    EpilogueScheduleType.TmaWarpSpecialized],
-    #      [KernelScheduleType.CpAsyncWarpSpecializedCooperative, EpilogueScheduleType.TmaWarpSpecializedCooperative]])
+  instantiation_level = manifest.get_sm90_instantiation_level(pruned_level=100, default_level=131, exhaustive_level=9999)
+  is_aligned = True
 
-    #   CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type,
-    #     [[KernelScheduleType.CpAsyncWarpSpecializedCooperative, EpilogueScheduleType.TmaWarpSpecializedCooperative]],
-    #     tile_schedulers=[TileSchedulerType.StreamK])
+  # layouts for ABC and their alignments.
+  layouts = [
+    [[LayoutType.ColumnMajor, 8], [LayoutType.ColumnMajor, 8], [LayoutType.ColumnMajor, 1]],
+    [[LayoutType.ColumnMajor, 8], [LayoutType.RowMajor,    8], [LayoutType.ColumnMajor, 1]],
+    [[LayoutType.RowMajor,   16], [LayoutType.ColumnMajor, 8], [LayoutType.ColumnMajor, 1]],
+    [[LayoutType.RowMajor,   16], [LayoutType.RowMajor,    8], [LayoutType.ColumnMajor, 1]],
+  ]
 
-    #   # Emit instance without C allocation + load
-    #   data_type["c_type"] = DataType.void
-    #   CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type,
-    #     [[KernelScheduleType.CpAsyncWarpSpecializedPingpong,    EpilogueScheduleType.TmaWarpSpecialized],
-    #      [KernelScheduleType.CpAsyncWarpSpecializedCooperative, EpilogueScheduleType.TmaWarpSpecializedCooperative]])
+  math_instructions = make_sparse_math_instructions(generate_fp16_bf16_math_instructions_sm90(instantiation_level))
+  tile_descriptions = generate_tile_descriptions_sm90(
+      math_instructions=math_instructions,
+      is_aligned=is_aligned,
+      level=instantiation_level)
 
-    #   CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type,
-    #     [[KernelScheduleType.CpAsyncWarpSpecializedCooperative, EpilogueScheduleType.TmaWarpSpecializedCooperative]],
-    #     tile_schedulers=[TileSchedulerType.StreamK])
+  for tile_desc in tile_descriptions:
+    math_inst = tile_desc.math_instruction
+    data_type_w_source = generate_data_types_from_math_instruction(math_inst)
+    data_type_wo_source = generate_data_types_from_math_instruction(math_inst, element_source=DataType.void)
+    data_types = [data_type_w_source, data_type_wo_source]
 
     # for mixed precision kernels, also generate kernels that write output matrix in the A/B format
     # Avoid emitting two kernels if the accumulator type does not differ from the input type (e.g. F16 accumulation)
     if math_inst.element_a != math_inst.element_accumulator:
-      data_type_mixed = {
-        "a_type"   : math_inst.element_a,
-        "b_type"   : math_inst.element_b,
-        "c_type"   : math_inst.element_a,
-        "d_type"   : math_inst.element_a,
-        "acc_type" : math_inst.element_accumulator,
-        "epi_type" : math_inst.element_accumulator
-      }
-
-      # Set alignment c based on Destination format.
-      for layout in layouts:
-        if data_type_mixed["c_type"] in [DataType.s32, DataType.f32]:
-          layout[2][1] = 4
-        elif data_type_mixed["c_type"] in [DataType.f16, DataType.bf16]:
-          layout[2][1] = 8
-
-      CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type_mixed, schedules)
-      CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type_mixed, stream_k_schedules, tile_schedulers=[TileSchedulerType.StreamK])
-
-      # persistent kernels with TMA epilogues
-      # if CudaToolkitVersionSatisfies(cuda_version, 12, 1):
-      #   CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type_mixed,
-      #     [[KernelScheduleType.CpAsyncWarpSpecializedPingpong,    EpilogueScheduleType.TmaWarpSpecialized],
-      #      [KernelScheduleType.CpAsyncWarpSpecializedCooperative, EpilogueScheduleType.TmaWarpSpecializedCooperative]])
-
-      #   CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type_mixed,
-      #     [[KernelScheduleType.CpAsyncWarpSpecializedCooperative, EpilogueScheduleType.TmaWarpSpecializedCooperative]],
-      #     tile_schedulers=[TileSchedulerType.StreamK])
-
-      #   # Emit instance without C allocation+load
-      #   data_type_mixed["c_type"] = DataType.void
-      #   CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type_mixed,
-      #     [[KernelScheduleType.CpAsyncWarpSpecializedPingpong,    EpilogueScheduleType.TmaWarpSpecialized],
-      #      [KernelScheduleType.CpAsyncWarpSpecializedCooperative, EpilogueScheduleType.TmaWarpSpecializedCooperative]])
-
-      #   CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type_mixed,
-      #     [[KernelScheduleType.CpAsyncWarpSpecializedCooperative, EpilogueScheduleType.TmaWarpSpecializedCooperative]],
-      #     tile_schedulers=[TileSchedulerType.StreamK])
+        data_type_mixed_w_source = generate_data_types_from_math_instruction(
+            math_inst,
+            element_source=math_inst.element_a,
+            element_dest=math_inst.element_a
+        )
+        data_type_mixed_wo_source = generate_data_types_from_math_instruction(
+            math_inst,
+            element_source=DataType.void,
+            element_dest=math_inst.element_a
+        )
+        data_types.append(data_type_mixed_w_source)
+        data_types.append(data_type_mixed_wo_source)
+
+    for layout in layouts:
+        for data_type in data_types:
+            layout = fix_alignments(data_type, layout, alignment_bits=128)
+
+            schedules, stream_k_schedules = get_valid_schedules(
+              tile_description=tile_desc,
+              cuda_version=cuda_version,
+              is_aligned=is_aligned,
+              data_types=data_type,
+              instantiation_level=instantiation_level,
+              layout=layout,
+            )
+
+            if len(schedules):
+              CreateSparseGemmUniversal3xOperator(manifest, [layout], [tile_desc], data_type, schedules)
+              if len(stream_k_schedules):
+                assert CudaToolkitVersionSatisfies(cuda_version, 12, 1)
+                CreateSparseGemmUniversal3xOperator(manifest, [layout], [tile_desc], data_type,
+                                                    stream_k_schedules,
+                                                    tile_schedulers=[TileSchedulerType.StreamK])
+
 
-#
 def GenerateSM90_TensorOp_tf32_WGMMA_gemm(manifest, cuda_version):
   if not CudaToolkitVersionSatisfies(cuda_version, 12, 0):
     return
 
+  instantiation_level = manifest.get_sm90_instantiation_level(pruned_level=120, default_level=121, exhaustive_level=9999)
+  is_aligned = True
+
   # layouts for ABC and their alignments
-  layouts_tf32 = [
+  layouts = [
     [[LayoutType.RowMajor,    4], [LayoutType.ColumnMajor, 4], [LayoutType.ColumnMajor, 4]],
     [[LayoutType.RowMajor,    4], [LayoutType.RowMajor,    4], [LayoutType.ColumnMajor, 4]],
     [[LayoutType.ColumnMajor, 4], [LayoutType.ColumnMajor, 4], [LayoutType.ColumnMajor, 4]],
     [[LayoutType.ColumnMajor, 4], [LayoutType.RowMajor,    4], [LayoutType.ColumnMajor, 4]],
   ]
 
-  math_inst = MathInstruction(
-      [64, 128, 8],
-      DataType.tf32, DataType.tf32, DataType.f32,
-      OpcodeClass.TensorOp,
-      MathOperation.multiply_add)
-
-  math_inst_largeN = MathInstruction(
-      [64, 256, 8],
-      DataType.tf32, DataType.tf32, DataType.f32,
-      OpcodeClass.TensorOp,
-      MathOperation.multiply_add)
-
-  min_cc = 90
-  max_cc = 90
-
-  tile_descriptions_large = [
-    TileDescription([math_inst.instruction_shape[0]*4, math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
-      0, [4, 1, 1], math_inst, min_cc, max_cc, [1,2,1]),
-    TileDescription([math_inst.instruction_shape[0]*4, math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
-      0, [4, 1, 1], math_inst, min_cc, max_cc, [2,1,1]),
-    TileDescription([math_inst_largeN.instruction_shape[0]*2, math_inst_largeN.instruction_shape[1], math_inst_largeN.instruction_shape[2]*4],
-      0, [4, 1, 1], math_inst_largeN, min_cc, max_cc, [2,1,1]),
-    TileDescription([math_inst_largeN.instruction_shape[0]*2, math_inst_largeN.instruction_shape[1], math_inst_largeN.instruction_shape[2]*4],
-      0, [4, 1, 1], math_inst_largeN, min_cc, max_cc, [1,2,1]),
-  ]
-
-  tile_descriptions_medium = [
-    TileDescription([math_inst.instruction_shape[0]*2, math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
-      0, [4, 1, 1], math_inst, min_cc, max_cc, [2,1,1]),
-    TileDescription([math_inst.instruction_shape[0]*2, math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
-      0, [4, 1, 1], math_inst, min_cc, max_cc, [1,2,1]),
-  ]
-
-  tile_descriptions_small = [
-    TileDescription([math_inst.instruction_shape[0], math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
-      0, [4, 1, 1], math_inst, min_cc, max_cc, [2,1,1]),
-    TileDescription([math_inst.instruction_shape[0], math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
-      0, [4, 1, 1], math_inst, min_cc, max_cc, [1,2,1]),
-  ]
-  tile_descriptions = tile_descriptions_medium + tile_descriptions_small
-
-  data_types = [
-    {
-      "a_type"   : math_inst.element_a,
-      "b_type"   : math_inst.element_b,
-      "c_type"   : math_inst.element_accumulator,
-      "d_type"   : math_inst.element_accumulator,
-      "acc_type" : math_inst.element_accumulator,
-      "epi_type" : math_inst.element_accumulator
-    },
-    {
-      "a_type"   : DataType.f32,
-      "b_type"   : DataType.f32,
-      "c_type"   : math_inst.element_accumulator,
-      "d_type"   : math_inst.element_accumulator,
-      "acc_type" : math_inst.element_accumulator,
-      "epi_type" : DataType.f32
-    }
-  ]
-
-  schedules_default = [
-    [KernelScheduleType.ScheduleAuto, EpilogueScheduleType.ScheduleAuto],
-    [KernelScheduleType.TmaWarpSpecialized, EpilogueScheduleType.NoSmemWarpSpecialized],
-  ]
-
-  # TMA kernels with TT layout use EpilogueTransposed (NoSmemWarpSpecialized with swapped strides),
-  # because they use NN kernels underneath and transposing its epilogue will get the correct output
-  schedules_transposed_epilogue = [
-    [KernelScheduleType.ScheduleAuto, EpilogueScheduleType.EpilogueTransposed],
-    [KernelScheduleType.TmaWarpSpecialized, EpilogueScheduleType.EpilogueTransposed]
-  ]
-
-  # TMA kernels with TN, NN, or NT layout
-  layouts_tf32_tn_nn_nt = [layouts_tf32[0], layouts_tf32[2], layouts_tf32[3]]
-  # TMA kernels with TT layout
-  layouts_tf32_tt = [layouts_tf32[1]]
-
-  if CudaToolkitVersionSatisfies(cuda_version, 12, 1):
-    CreateGemmUniversal3xOperator(manifest, layouts_tf32_tn_nn_nt, tile_descriptions_small, data_types, [
-      [KernelScheduleType.TmaWarpSpecializedPingpong, EpilogueScheduleType.TmaWarpSpecialized],
-      [KernelScheduleType.TmaWarpSpecializedPingpong, EpilogueScheduleType.NoSmemWarpSpecialized]
-    ])
-
-    CreateGemmUniversal3xOperator(manifest, layouts_tf32_tn_nn_nt, tile_descriptions_medium, data_types, [
-      [KernelScheduleType.TmaWarpSpecializedPingpong, EpilogueScheduleType.TmaWarpSpecialized],
-      [KernelScheduleType.TmaWarpSpecializedPingpong, EpilogueScheduleType.NoSmemWarpSpecialized],
-      [KernelScheduleType.TmaWarpSpecializedCooperative, EpilogueScheduleType.TmaWarpSpecializedCooperative],
-      [KernelScheduleType.TmaWarpSpecializedCooperative, EpilogueScheduleType.NoSmemWarpSpecialized]
-    ])
-
-    CreateGemmUniversal3xOperator(manifest, layouts_tf32_tn_nn_nt, tile_descriptions_large, data_types, [
-      [KernelScheduleType.TmaWarpSpecializedCooperative, EpilogueScheduleType.NoSmemWarpSpecialized],
-    ])
-
-    CreateGemmUniversal3xOperator(manifest, layouts_tf32_tt, tile_descriptions_small, data_types, [
-      [KernelScheduleType.TmaWarpSpecializedPingpong, EpilogueScheduleType.EpilogueTransposed]
-    ])
+  math_instructions = generate_tf32_math_instructions_sm90(instantiation_level)
+  tile_descriptions = generate_tile_descriptions_sm90(
+      math_instructions=math_instructions,
+      is_aligned=is_aligned,
+      level=instantiation_level)
 
-    CreateGemmUniversal3xOperator(manifest, layouts_tf32_tt, tile_descriptions_medium, data_types, [
-      [KernelScheduleType.TmaWarpSpecializedCooperative, EpilogueScheduleType.EpilogueTransposed],
-      [KernelScheduleType.TmaWarpSpecializedPingpong, EpilogueScheduleType.EpilogueTransposed]
-    ])
+  for tile_desc in tile_descriptions:
+    math_inst = tile_desc.math_instruction
 
-    CreateGemmUniversal3xOperator(manifest, layouts_tf32_tt, tile_descriptions_large, data_types, [
-      [KernelScheduleType.TmaWarpSpecializedCooperative, EpilogueScheduleType.EpilogueTransposed],
-    ])
+    for layout in layouts:
+        data_type_tf32 = generate_data_types_from_math_instruction(math_inst)
+        data_type_tf32_wo_source = generate_data_types_from_math_instruction(math_inst, element_source=DataType.void)
+        data_type_f32 = copy.deepcopy(data_type_tf32)
+        data_type_f32_wo_source = copy.deepcopy(data_type_tf32_wo_source)
+        data_type_f32["a_type"] = DataType.f32
+        data_type_f32["b_type"] = DataType.f32
+        data_type_f32["epi_type"] = DataType.f32
+        data_type_f32_wo_source["a_type"] = DataType.f32
+        data_type_f32_wo_source["b_type"] = DataType.f32
+        data_type_f32_wo_source["epi_type"] = DataType.f32
+        data_types = [data_type_tf32, data_type_f32, data_type_tf32_wo_source, data_type_f32_wo_source]
+
+        for data_type in data_types:
+            layout = fix_alignments(data_type, layout, alignment_bits=128)
+
+            schedules, stream_k_schedules = get_valid_schedules(
+              tile_description=tile_desc,
+              cuda_version=cuda_version,
+              is_aligned=is_aligned,
+              data_types=data_type,
+              instantiation_level=instantiation_level,
+              layout=layout,
+            )
+
+            if len(schedules):
+              CreateGemmUniversal3xOperator(manifest, [layout], [tile_desc], data_type, schedules)
+              if len(stream_k_schedules):
+                assert CudaToolkitVersionSatisfies(cuda_version, 12, 1)
+                CreateGemmUniversal3xOperator(manifest, [layout], [tile_desc], data_type,
+                                              stream_k_schedules,
+                                              tile_schedulers=[TileSchedulerType.StreamK])
 
-  else:
-    CreateGemmUniversal3xOperator(manifest, layouts_tf32_tn_nn_nt, tile_descriptions, data_types, schedules_default)
-    CreateGemmUniversal3xOperator(manifest, layouts_tf32_tt, tile_descriptions, data_types, schedules_transposed_epilogue)
 
-#
 def GenerateSM90_TensorOp_tf32_WGMMA_alignx_gemm(manifest, cuda_version):
   if not CudaToolkitVersionSatisfies(cuda_version, 12, 0):
     return
 
+  instantiation_level = manifest.get_sm90_instantiation_level(pruned_level=100, default_level=101, exhaustive_level=9999)
+  is_aligned = False
+
   # layouts for ABC and their alignments.
   layouts = [
     [[LayoutType.RowMajor,    2], [LayoutType.ColumnMajor, 2], [LayoutType.ColumnMajor, 1]],
@@ -5401,642 +5467,448 @@ def GenerateSM90_TensorOp_tf32_WGMMA_alignx_gemm(manifest, cuda_version):
     [[LayoutType.ColumnMajor, 1], [LayoutType.RowMajor,    1], [LayoutType.ColumnMajor, 1]],
   ]
 
-  math_inst = MathInstruction(
-      [64, 128, 8],
-      DataType.tf32, DataType.tf32, DataType.f32,
-      OpcodeClass.TensorOp,
-      MathOperation.multiply_add)
+  math_instructions = generate_tf32_math_instructions_sm90(instantiation_level)
+  tile_descriptions = generate_tile_descriptions_sm90(
+      math_instructions=math_instructions,
+      is_aligned=is_aligned,
+      level=instantiation_level)
 
-  min_cc = 90
-  max_cc = 90
+  for tile_desc in tile_descriptions:
+    math_inst = tile_desc.math_instruction
 
-  tile_descriptions_medium = [
-    TileDescription([math_inst.instruction_shape[0]*2, math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
-      0, [4, 1, 1], math_inst, min_cc, max_cc, [1,1,1])
-  ]
-
-  tile_descriptions_small = [
-    # TileDescription([math_inst.instruction_shape[0], math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
-    #   0, [4, 1, 1], math_inst, min_cc, max_cc, [1,1,1])
-  ]
-
-  tile_descriptions = tile_descriptions_medium + tile_descriptions_small
-
-  data_types = [
-    {
-      "a_type"   : math_inst.element_a,
-      "b_type"   : math_inst.element_b,
-      "c_type"   : math_inst.element_accumulator,
-      "d_type"   : math_inst.element_accumulator,
-      "acc_type" : math_inst.element_accumulator,
-      "epi_type" : math_inst.element_accumulator
-    },
-    {
-      "a_type"   : DataType.f32,
-      "b_type"   : DataType.f32,
-      "c_type"   : math_inst.element_accumulator,
-      "d_type"   : math_inst.element_accumulator,
-      "acc_type" : math_inst.element_accumulator,
-      "epi_type" : DataType.f32
-    }
-  ]
+    for layout in layouts:
+        # Inconsistency: TF32 does not stamp out void-C
+        data_type_tf32 = generate_data_types_from_math_instruction(math_inst)
+        data_type_f32 = copy.deepcopy(data_type_tf32)
+        data_type_f32["a_type"] = DataType.f32
+        data_type_f32["b_type"] = DataType.f32
+        data_type_f32["epi_type"] = DataType.f32
+        for data_type in [data_type_tf32, data_type_f32]:
+            # Inconsistency: alignments aren't fixed in TF32 / alignx
+            # layout = fix_alignments(data_type, layout, alignment_bits=128)
+
+            schedules, stream_k_schedules = get_valid_schedules(
+              tile_description=tile_desc,
+              cuda_version=cuda_version,
+              is_aligned=is_aligned,
+              data_types=data_type,
+              instantiation_level=instantiation_level,
+              layout=layout,
+            )
+
+            if len(schedules):
+              CreateGemmUniversal3xOperator(manifest, [layout], [tile_desc], data_type, schedules)
+              if len(stream_k_schedules):
+                assert CudaToolkitVersionSatisfies(cuda_version, 12, 1)
+                CreateGemmUniversal3xOperator(manifest, [layout], [tile_desc], data_type,
+                                              stream_k_schedules,
+                                              tile_schedulers=[TileSchedulerType.StreamK])
+
+
+def GenerateSM90_SparseTensorOp_tf32_WGMMA_gemm(manifest, cuda_version):
+  if not CudaToolkitVersionSatisfies(cuda_version, 12, 2):
+    return
 
-  is_tt_layout = lambda v: v[0][0] == LayoutType.RowMajor and v[1][0] == LayoutType.RowMajor
-  # Split kernels into TN/NT, NN or TT layouts
-  layouts_tn_nn_nt = filter(lambda v: not is_tt_layout(v), layouts)
-  layouts_tt = filter(is_tt_layout, layouts)
+  instantiation_level = manifest.get_sm90_instantiation_level(pruned_level=120, default_level=121, exhaustive_level=9999)
+  is_aligned = True
 
-  CreateGemmUniversal3xOperator(manifest, layouts_tn_nn_nt, tile_descriptions, data_types, [
-    # [KernelScheduleType.ScheduleAuto, EpilogueScheduleType.ScheduleAuto],
-    [KernelScheduleType.CpAsyncWarpSpecialized, EpilogueScheduleType.NoSmemWarpSpecialized],
-  ])
+  # layouts for ABC and their alignments
+  layouts = [
+    [[LayoutType.RowMajor,    8], [LayoutType.ColumnMajor, 4], [LayoutType.ColumnMajor, 4]],
+  ]
 
-  # Kernels with TT layout use EpilogueTransposed (NoSmemWarpSpecialized with swapped strides),
-  # because they use NN kernels underneath and transposing its epilogue will get the correct output
-  CreateGemmUniversal3xOperator(manifest, layouts_tt, tile_descriptions, data_types, [
-    # [KernelScheduleType.ScheduleAuto, EpilogueScheduleType.EpilogueTransposed],
-    [KernelScheduleType.CpAsyncWarpSpecialized, EpilogueScheduleType.EpilogueTransposed]
-  ])
+  math_instructions = make_sparse_math_instructions(generate_tf32_math_instructions_sm90(instantiation_level))
+  tile_descriptions = generate_tile_descriptions_sm90(
+      math_instructions=math_instructions,
+      is_aligned=is_aligned,
+      level=instantiation_level)
 
-  if CudaToolkitVersionSatisfies(cuda_version, 12, 1):
-    CreateGemmUniversal3xOperator(manifest, layouts_tn_nn_nt, tile_descriptions, data_types, [
-      # [KernelScheduleType.CpAsyncWarpSpecializedPingpong, EpilogueScheduleType.NoSmemWarpSpecialized],
-      [KernelScheduleType.CpAsyncWarpSpecializedCooperative, EpilogueScheduleType.NoSmemWarpSpecialized]
-    ])
+  for tile_desc in tile_descriptions:
+    math_inst = tile_desc.math_instruction
+
+    for layout in layouts:
+        data_type_tf32 = generate_data_types_from_math_instruction(math_inst)
+        data_type_tf32_wo_source = generate_data_types_from_math_instruction(math_inst, element_source=DataType.void)
+        data_type_f32 = copy.deepcopy(data_type_tf32)
+        data_type_f32_wo_source = copy.deepcopy(data_type_tf32_wo_source)
+        data_type_f32["a_type"] = DataType.f32
+        data_type_f32["b_type"] = DataType.f32
+        data_type_f32["epi_type"] = DataType.f32
+        data_type_f32_wo_source["a_type"] = DataType.f32
+        data_type_f32_wo_source["b_type"] = DataType.f32
+        data_type_f32_wo_source["epi_type"] = DataType.f32
+        data_types = [data_type_tf32, data_type_f32, data_type_tf32_wo_source, data_type_f32_wo_source]
+
+        for data_type in data_types:
+            layout = fix_alignments(data_type, layout, alignment_bits=128)
+
+            schedules, stream_k_schedules = get_valid_schedules(
+              tile_description=tile_desc,
+              cuda_version=cuda_version,
+              is_aligned=is_aligned,
+              data_types=data_type,
+              instantiation_level=instantiation_level,
+              layout=layout,
+            )
+
+            if len(schedules):
+              CreateSparseGemmUniversal3xOperator(manifest, [layout], [tile_desc], data_type, schedules)
+              if len(stream_k_schedules):
+                assert CudaToolkitVersionSatisfies(cuda_version, 12, 1)
+                CreateSparseGemmUniversal3xOperator(manifest, [layout], [tile_desc], data_type,
+                                                    stream_k_schedules,
+                                                    tile_schedulers=[TileSchedulerType.StreamK])
 
-    # Stream-K schedules
-    CreateGemmUniversal3xOperator(manifest, layouts_tn_nn_nt, tile_descriptions, data_types, [
-      [KernelScheduleType.CpAsyncWarpSpecializedCooperative, EpilogueScheduleType.NoSmemWarpSpecialized]
-    ], tile_schedulers=[TileSchedulerType.StreamK])
 
-#
 def GenerateSM90_TensorOp_int8_WGMMA_gemm(manifest, cuda_version):
   if not CudaToolkitVersionSatisfies(cuda_version, 12, 0):
     return
 
+  instantiation_level = manifest.get_sm90_instantiation_level(pruned_level=100, default_level=111, exhaustive_level=9999)
+  is_aligned = True
+
   # layouts for ABC and their alignments
   layouts = [
     [[LayoutType.RowMajor, 16], [LayoutType.ColumnMajor, 16], [LayoutType.ColumnMajor, 16]],
   ]
 
-  math_instructions = [
-    MathInstruction(
-      [64, 128, 32],
-      DataType.s8, DataType.s8, DataType.s32,
-      OpcodeClass.TensorOp,
-      MathOperation.multiply_add),
-    MathInstruction(
-      [64, 128, 32],
-      DataType.u8, DataType.u8, DataType.s32,
-      OpcodeClass.TensorOp,
-      MathOperation.multiply_add),
-  ]
-
-  min_cc = 90
-  max_cc = 90
-
-  for math_inst in math_instructions:
-    # 64x128x128
-    tile_descriptions_small = [
-      TileDescription([math_inst.instruction_shape[0], math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
-        0, [4, 1, 1], math_inst, min_cc, max_cc, [2,1,1]),
-      TileDescription([math_inst.instruction_shape[0], math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
-        0, [4, 1, 1], math_inst, min_cc, max_cc, [1,2,1]),
-    ]
-    # 128x128x128
-    tile_descriptions_medium = [
-      TileDescription([math_inst.instruction_shape[0]*2, math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
-        0, [4, 1, 1], math_inst, min_cc, max_cc, [2,1,1]),
-      TileDescription([math_inst.instruction_shape[0]*2, math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
-        0, [4, 1, 1], math_inst, min_cc, max_cc, [1,2,1]),
-    ]
-    tile_descriptions = tile_descriptions_medium + tile_descriptions_small
+  math_instructions = generate_int8_math_instructions_sm90(instantiation_level)
+  tile_descriptions = generate_tile_descriptions_sm90(
+      math_instructions=math_instructions,
+      is_aligned=is_aligned,
+      level=instantiation_level)
+
+  for tile_desc in tile_descriptions:
+    math_inst = tile_desc.math_instruction
+    data_type_w_source = generate_data_types_from_math_instruction(math_inst)
+    data_type_wo_source = generate_data_types_from_math_instruction(math_inst, element_source=DataType.void)
+    data_type_int8_output = generate_data_types_from_math_instruction(
+        math_inst,
+        element_source=DataType.s8,
+        element_dest=math_inst.element_a,
+        element_epilogue=DataType.f32
+    )
+    data_types = [data_type_w_source, data_type_wo_source, data_type_int8_output]
 
-    data_types = [
-      {
-        "a_type"   : math_inst.element_a,
-        "b_type"   : math_inst.element_b,
-        "c_type"   : math_inst.element_accumulator,
-        "d_type"   : math_inst.element_accumulator,
-        "acc_type" : math_inst.element_accumulator,
-        "epi_type" : math_inst.element_accumulator
-      },
-      {
-        "a_type"   : math_inst.element_a,
-        "b_type"   : math_inst.element_b,
-        "c_type"   : DataType.s8,
-        "d_type"   : math_inst.element_a,
-        "acc_type" : math_inst.element_accumulator,
-        "epi_type" : DataType.f32
-      }
-    ]
+    for layout in layouts:
+        for data_type in data_types:
+            layout = fix_alignments(data_type, layout, alignment_bits=128)
+
+            schedules, stream_k_schedules = get_valid_schedules(
+              tile_description=tile_desc,
+              cuda_version=cuda_version,
+              is_aligned=is_aligned,
+              data_types=data_type,
+              instantiation_level=instantiation_level,
+              layout=layout,
+            )
+
+            if len(schedules):
+              CreateGemmUniversal3xOperator(manifest, [layout], [tile_desc], data_type, schedules)
+              if len(stream_k_schedules):
+                assert CudaToolkitVersionSatisfies(cuda_version, 12, 1)
+                CreateGemmUniversal3xOperator(manifest, [layout], [tile_desc], data_type,
+                                              stream_k_schedules,
+                                              tile_schedulers=[TileSchedulerType.StreamK])
 
-    for data_type in data_types:
-      for layout in layouts:
-        layout[2][1] = 128 // DataTypeSize[data_type["d_type"]]
-      CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type)
-
-    # persistent kernels with TMA epilogues
-    if CudaToolkitVersionSatisfies(cuda_version, 12, 1):
-      # Emit instance without C allocation+load
-      data_types += [
-        {
-          "a_type"   : math_inst.element_a,
-          "b_type"   : math_inst.element_b,
-          "c_type"   : DataType.void,
-          "d_type"   : math_inst.element_accumulator,
-          "acc_type" : math_inst.element_accumulator,
-          "epi_type" : math_inst.element_accumulator
-        }
-      ]
-      for data_type in data_types:
-        # Set output alignment based on destination format first
-        for layout in layouts:
-          layout[2][1] = 128 // DataTypeSize[data_type["d_type"]]
-        # Pingpong persistent
-        CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type,
-          [[KernelScheduleType.TmaWarpSpecializedPingpong,    EpilogueScheduleType.TmaWarpSpecialized],
-           [KernelScheduleType.TmaWarpSpecializedPingpong,    EpilogueScheduleType.NoSmemWarpSpecialized]])
-        # Cooperative persistent
-        CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions_medium, data_type,
-          [[KernelScheduleType.TmaWarpSpecializedCooperative, EpilogueScheduleType.TmaWarpSpecializedCooperative],
-           [KernelScheduleType.TmaWarpSpecializedCooperative, EpilogueScheduleType.NoSmemWarpSpecialized]],
-           tile_schedulers=[TileSchedulerType.Persistent, TileSchedulerType.StreamK]
-           )
 
-#
 def GenerateSM90_TensorOp_int8_WGMMA_alignx_gemm(manifest, cuda_version):
   if not CudaToolkitVersionSatisfies(cuda_version, 12, 0):
     return
 
+  instantiation_level = manifest.get_sm90_instantiation_level(pruned_level=100, default_level=111, exhaustive_level=9999)
+  is_aligned = False
+
   # layouts for ABC and their alignments
   layouts = [
     [[LayoutType.RowMajor,  8], [LayoutType.ColumnMajor,  8], [LayoutType.ColumnMajor, 1]],
     [[LayoutType.RowMajor,  4], [LayoutType.ColumnMajor,  4], [LayoutType.ColumnMajor, 1]],
   ]
 
-  math_instructions = [
-    MathInstruction(
-      [64, 128, 32],
-      DataType.s8, DataType.s8, DataType.s32,
-      OpcodeClass.TensorOp,
-      MathOperation.multiply_add),
-    MathInstruction(
-      [64, 128, 32],
-      DataType.u8, DataType.u8, DataType.s32,
-      OpcodeClass.TensorOp,
-      MathOperation.multiply_add),
-  ]
+  math_instructions = generate_int8_math_instructions_sm90(instantiation_level)
+  tile_descriptions = generate_tile_descriptions_sm90(
+      math_instructions=math_instructions,
+      is_aligned=is_aligned,
+      level=instantiation_level)
+
+  for tile_desc in tile_descriptions:
+    math_inst = tile_desc.math_instruction
+    data_type_w_source = generate_data_types_from_math_instruction(math_inst)
+    data_type_int8_output = generate_data_types_from_math_instruction(
+        math_inst,
+        element_source=DataType.s8,
+        element_dest=math_inst.element_a,
+        element_epilogue=DataType.f32
+    )
+    data_types = [data_type_w_source, data_type_int8_output]
 
-  min_cc = 90
-  max_cc = 90
+    for layout in layouts:
+        for data_type in data_types:
+            layout = fix_alignments(data_type, layout, alignment_bits=128)
+
+            schedules, stream_k_schedules = get_valid_schedules(
+              tile_description=tile_desc,
+              cuda_version=cuda_version,
+              is_aligned=is_aligned,
+              data_types=data_type,
+              instantiation_level=instantiation_level,
+              layout=layout,
+            )
+
+            if len(schedules):
+              CreateGemmUniversal3xOperator(manifest, [layout], [tile_desc], data_type, schedules)
+              if len(stream_k_schedules):
+                assert CudaToolkitVersionSatisfies(cuda_version, 12, 1)
+                CreateGemmUniversal3xOperator(manifest, [layout], [tile_desc], data_type,
+                                              stream_k_schedules,
+                                              tile_schedulers=[TileSchedulerType.StreamK])
+
+
+def GenerateSM90_SparseTensorOp_int8_WGMMA_gemm(manifest, cuda_version):
+  if not CudaToolkitVersionSatisfies(cuda_version, 12, 2):
+    return
 
-  for math_inst in math_instructions:
-    tile_descriptions_small = [
-      # TileDescription([math_inst.instruction_shape[0], math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
-      #   0, [4, 1, 1], math_inst, min_cc, max_cc, [1,1,1]),
-    ]
-    tile_descriptions_medium = [
-      TileDescription([math_inst.instruction_shape[0]*2, math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
-        0, [4, 1, 1], math_inst, min_cc, max_cc, [1,1,1]),
-    ]
-    tile_descriptions = tile_descriptions_medium + tile_descriptions_small
+  instantiation_level = manifest.get_sm90_instantiation_level(pruned_level=100, default_level=111, exhaustive_level=9999)
+  is_aligned = True
 
-    data_types = [
-      {
-        "a_type"   : math_inst.element_a,
-        "b_type"   : math_inst.element_b,
-        "c_type"   : math_inst.element_accumulator,
-        "d_type"   : math_inst.element_accumulator,
-        "acc_type" : math_inst.element_accumulator,
-        "epi_type" : math_inst.element_accumulator
-      },
-      {
-        "a_type"   : math_inst.element_a,
-        "b_type"   : math_inst.element_b,
-        "c_type"   : DataType.s8,
-        "d_type"   : math_inst.element_a,
-        "acc_type" : math_inst.element_accumulator,
-        "epi_type" : DataType.f32
-      }
-    ]
+  # layouts for ABC and their alignments
+  layouts = [
+    [[LayoutType.RowMajor, 32], [LayoutType.ColumnMajor, 16], [LayoutType.ColumnMajor, 16]],
+  ]
+
+  math_instructions = make_sparse_math_instructions(generate_int8_math_instructions_sm90(instantiation_level))
+  tile_descriptions = generate_tile_descriptions_sm90(
+      math_instructions=math_instructions,
+      is_aligned=is_aligned,
+      level=instantiation_level)
+
+  for tile_desc in tile_descriptions:
+    math_inst = tile_desc.math_instruction
+    # s8.u8 and u8.s8 wgmma variants require PTX 8.4
+    if math_inst.element_a != math_inst.element_b and not CudaToolkitVersionSatisfies(cuda_version, 12, 4):
+      continue
+    data_type_w_source = generate_data_types_from_math_instruction(math_inst)
+    data_type_wo_source = generate_data_types_from_math_instruction(math_inst, element_source=DataType.void)
+    data_type_int8_output = generate_data_types_from_math_instruction(
+        math_inst,
+        element_source=DataType.s8,
+        element_dest=math_inst.element_a,
+        element_epilogue=DataType.f32
+    )
+    data_types = [data_type_w_source, data_type_wo_source, data_type_int8_output]
 
-    for data_type in data_types:
-      for layout in layouts:
-        layout[2][1] = 128 // DataTypeSize[data_type["d_type"]]
-      CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type, [
-        # [KernelScheduleType.ScheduleAuto, EpilogueScheduleType.NoSmemWarpSpecialized],
-        [KernelScheduleType.CpAsyncWarpSpecialized, EpilogueScheduleType.NoSmemWarpSpecialized]
-      ])
+    for layout in layouts:
+        for data_type in data_types:
+            layout = fix_alignments(data_type, layout, alignment_bits=128)
+
+            schedules, stream_k_schedules = get_valid_schedules(
+              tile_description=tile_desc,
+              cuda_version=cuda_version,
+              is_aligned=is_aligned,
+              data_types=data_type,
+              instantiation_level=instantiation_level,
+              layout=layout,
+            )
+
+            if len(schedules):
+              CreateSparseGemmUniversal3xOperator(manifest, [layout], [tile_desc], data_type, schedules)
+              if len(stream_k_schedules):
+                assert CudaToolkitVersionSatisfies(cuda_version, 12, 1)
+                CreateSparseGemmUniversal3xOperator(manifest, [layout], [tile_desc], data_type,
+                                                    stream_k_schedules,
+                                                    tile_schedulers=[TileSchedulerType.StreamK])
 
-      if CudaToolkitVersionSatisfies(cuda_version, 12, 1):
-        CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type, [
-          # [KernelScheduleType.CpAsyncWarpSpecializedPingpong,    EpilogueScheduleType.NoSmemWarpSpecialized],
-          [KernelScheduleType.CpAsyncWarpSpecializedCooperative, EpilogueScheduleType.NoSmemWarpSpecialized]
-        ])
-        CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type,
-          [[KernelScheduleType.CpAsyncWarpSpecializedCooperative, EpilogueScheduleType.NoSmemWarpSpecialized]],
-          tile_schedulers=[TileSchedulerType.StreamK])
 
-#
 def GenerateSM90_TensorOp_fp8_WGMMA_gemm(manifest, cuda_version):
   if not CudaToolkitVersionSatisfies(cuda_version, 12, 0):
     return
 
+  instantiation_level = manifest.get_sm90_instantiation_level(pruned_level=20, default_level=121, exhaustive_level=9999)
+  is_aligned = True
+
   # layouts for ABC and their alignments
   layouts = [
     [[LayoutType.RowMajor, 16], [LayoutType.ColumnMajor, 16], [LayoutType.ColumnMajor, 1]],  # TN Layout
   ]
 
-  math_instructions = [
-    # inst 64x128x32
-    MathInstruction(
-      [64, 128, 32],
-      DataType.e4m3, DataType.e4m3, DataType.f32,
-      OpcodeClass.TensorOp,
-      MathOperation.multiply_add),
-    MathInstruction(
-      [64, 128, 32],
-      DataType.e4m3, DataType.e5m2, DataType.f32,
-      OpcodeClass.TensorOp,
-      MathOperation.multiply_add),
-    MathInstruction(
-      [64, 128, 32],
-      DataType.e5m2, DataType.e4m3, DataType.f32,
-      OpcodeClass.TensorOp,
-      MathOperation.multiply_add),
-    MathInstruction(
-      [64, 128, 32],
-      DataType.e5m2, DataType.e5m2, DataType.f32,
-      OpcodeClass.TensorOp,
-      MathOperation.multiply_add),
-  ]
-
-  min_cc = 90
-  max_cc = 90
-
-  for math_inst in math_instructions:
-    data_types = [
-      {
-        "a_type"   : math_inst.element_a,
-        "b_type"   : math_inst.element_b,
-        "c_type"   : DataType.f32,
-        "d_type"   : DataType.f32,
-        "acc_type" : math_inst.element_accumulator,
-        "epi_type" : math_inst.element_accumulator
-      },
-      {
-        "a_type"   : math_inst.element_a,
-        "b_type"   : math_inst.element_b,
-        "c_type"   : DataType.f32,
-        "d_type"   : DataType.e4m3,
-        "acc_type" : math_inst.element_accumulator,
-        "epi_type" : math_inst.element_accumulator
-      },
-      {
-        "a_type"   : math_inst.element_a,
-        "b_type"   : math_inst.element_b,
-        "c_type"   : DataType.f32,
-        "d_type"   : DataType.e5m2,
-        "acc_type" : math_inst.element_accumulator,
-        "epi_type" : math_inst.element_accumulator
-      },
-      {
-        "a_type"   : math_inst.element_a,
-        "b_type"   : math_inst.element_b,
-        "c_type"   : DataType.bf16,
-        "d_type"   : DataType.bf16,
-        "acc_type" : math_inst.element_accumulator,
-        "epi_type" : math_inst.element_accumulator
-      },
-      {
-        "a_type"   : math_inst.element_a,
-        "b_type"   : math_inst.element_b,
-        "c_type"   : DataType.bf16,
-        "d_type"   : DataType.e4m3,
-        "acc_type" : math_inst.element_accumulator,
-        "epi_type" : math_inst.element_accumulator
-      },
-      {
-        "a_type"   : math_inst.element_a,
-        "b_type"   : math_inst.element_b,
-        "c_type"   : DataType.bf16,
-        "d_type"   : DataType.e5m2,
-        "acc_type" : math_inst.element_accumulator,
-        "epi_type" : math_inst.element_accumulator
-      },
-      {
-        "a_type"   : math_inst.element_a,
-        "b_type"   : math_inst.element_b,
-        "c_type"   : DataType.f16,
-        "d_type"   : DataType.f16,
-        "acc_type" : math_inst.element_accumulator,
-        "epi_type" : math_inst.element_accumulator
-      },
-      {
-        "a_type"   : math_inst.element_a,
-        "b_type"   : math_inst.element_b,
-        "c_type"   : DataType.f16,
-        "d_type"   : DataType.e4m3,
-        "acc_type" : math_inst.element_accumulator,
-        "epi_type" : math_inst.element_accumulator
-      },
-      {
-        "a_type"   : math_inst.element_a,
-        "b_type"   : math_inst.element_b,
-        "c_type"   : DataType.f16,
-        "d_type"   : DataType.e5m2,
-        "acc_type" : math_inst.element_accumulator,
-        "epi_type" : math_inst.element_accumulator
-      },
-    ]
-
-    data_types_large_tile = [
-      {
-        "a_type"   : math_inst.element_a,
-        "b_type"   : math_inst.element_b,
-        "c_type"   : DataType.void,
-        "d_type"   : DataType.e5m2,
-        "acc_type" : math_inst.element_accumulator,
-        "epi_type" : math_inst.element_accumulator
-      },
-      {
-        "a_type"   : math_inst.element_a,
-        "b_type"   : math_inst.element_b,
-        "c_type"   : DataType.void,
-        "d_type"   : DataType.e4m3,
-        "acc_type" : math_inst.element_accumulator,
-        "epi_type" : math_inst.element_accumulator
-      }
-    ]
-
-    if math_inst.instruction_shape[1] == 128:
-      tile_descriptions_small = [
-        # 64x128x128
-        TileDescription([math_inst.instruction_shape[0], math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
-          0, [4, 1, 1], math_inst, min_cc, max_cc, [1,2,1]),
-        TileDescription([math_inst.instruction_shape[0], math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
-          0, [4, 1, 1], math_inst, min_cc, max_cc, [2,1,1]),
-        TileDescription([math_inst.instruction_shape[0], math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
-          0, [4, 1, 1], math_inst, min_cc, max_cc, [1,1,1]),
-      ]
-      tile_descriptions_large = [
-        # 256x128x128
-        TileDescription([math_inst.instruction_shape[0]*4, math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
-          0, [4, 1, 1], math_inst, min_cc, max_cc, [1,2,1]),
-        TileDescription([math_inst.instruction_shape[0]*4, math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
-          0, [4, 1, 1], math_inst, min_cc, max_cc, [2,1,1]),
-        TileDescription([math_inst.instruction_shape[0]*4, math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
-          0, [4, 1, 1], math_inst, min_cc, max_cc, [1,1,1]),
-        # 128x256x128
-        TileDescription([math_inst.instruction_shape[0]*2, math_inst.instruction_shape[1]*2, math_inst.instruction_shape[2]*4],
-          0, [4, 1, 1], math_inst, min_cc, max_cc, [1,1,1]),
-      ]
-      tile_descriptions = [
-        # 128x128x128
-        TileDescription([math_inst.instruction_shape[0]*2, math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
-          0, [4, 1, 1], math_inst, min_cc, max_cc, [1,2,1]),
-        TileDescription([math_inst.instruction_shape[0]*2, math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
-          0, [4, 1, 1], math_inst, min_cc, max_cc, [2,1,1]),
-        TileDescription([math_inst.instruction_shape[0]*2, math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
-          0, [4, 1, 1], math_inst, min_cc, max_cc, [1,1,1]),
-      ]
-
-    else:
-      assert False, "math inst is not supported"
-
-    # some schedules disabled to save on library size
-    if CudaToolkitVersionSatisfies(cuda_version, 12, 1):
-      schedules = [
-        #[KernelScheduleType.ScheduleAuto,                             EpilogueScheduleType.ScheduleAuto],
-        [KernelScheduleType.TmaWarpSpecializedCooperative,             EpilogueScheduleType.NoSmemWarpSpecialized],
-        [KernelScheduleType.TmaWarpSpecialized,                        EpilogueScheduleType.NoSmemWarpSpecialized],
-        [KernelScheduleType.TmaWarpSpecializedPingpongFP8FastAccum,    EpilogueScheduleType.NoSmemWarpSpecialized],
-        [KernelScheduleType.TmaWarpSpecializedCooperativeFP8FastAccum, EpilogueScheduleType.NoSmemWarpSpecialized],
-        [KernelScheduleType.TmaWarpSpecializedFP8FastAccum,            EpilogueScheduleType.NoSmemWarpSpecialized]
-      ]
-      stream_k_schedules = [[KernelScheduleType.TmaWarpSpecializedCooperative,             EpilogueScheduleType.NoSmemWarpSpecialized],
-                            [KernelScheduleType.TmaWarpSpecializedCooperativeFP8FastAccum, EpilogueScheduleType.NoSmemWarpSpecialized]]
+  math_instructions = generate_fp8_math_instructions_sm90(instantiation_level)
+  tile_descriptions = generate_tile_descriptions_sm90(
+      math_instructions=math_instructions,
+      is_aligned=is_aligned,
+      level=instantiation_level)
+
+  for tile_desc in tile_descriptions:
+    math_inst = tile_desc.math_instruction
+    data_types = []
+    fp8_types = [DataType.e4m3, DataType.e5m2]
+    valid_types_for_d = [DataType.f32, DataType.bf16, DataType.f16, DataType.e4m3, DataType.e5m2]
+    valid_types_for_c = copy.deepcopy(valid_types_for_d)
+    valid_types_for_c.append(DataType.void)
+    for c_type, d_type in product(valid_types_for_c, valid_types_for_d):
+        data_types.append(
+            generate_data_types_from_math_instruction(
+                math_inst,
+                element_source=c_type,
+                element_dest=d_type,
+            )
+        )
     else:
-      schedules = [
-        # [KernelScheduleType.ScheduleAuto, EpilogueScheduleType.ScheduleAuto],
-        [KernelScheduleType.TmaWarpSpecialized, EpilogueScheduleType.NoSmemWarpSpecialized]
-        # TmaWarpSpecializedCooperative require CUDA version >= 12.1 for optimal performance.
-      ]
-      stream_k_schedules = []
+        for d_type in valid_types_for_d:
+            data_types.append(
+                generate_data_types_from_math_instruction(
+                    math_inst,
+                    element_source=DataType.void,
+                    element_dest=d_type,
+                )
+            )
+
+    for layout in layouts:
+        for data_type in data_types:
+            # Inconsistency: alignments aren't fixed in FP8
+            # layout = fix_alignments(data_type, layout, alignment_bits=128)
+
+            schedules, stream_k_schedules = get_valid_schedules(
+              tile_description=tile_desc,
+              cuda_version=cuda_version,
+              is_aligned=is_aligned,
+              data_types=data_type,
+              instantiation_level=instantiation_level,
+              layout=layout,
+            )
+
+            if len(schedules):
+              CreateGemmUniversal3xOperator(manifest, [layout], [tile_desc], data_type, schedules)
+              if len(stream_k_schedules):
+                assert CudaToolkitVersionSatisfies(cuda_version, 12, 1)
+                CreateGemmUniversal3xOperator(manifest, [layout], [tile_desc], data_type,
+                                              stream_k_schedules,
+                                              tile_schedulers=[TileSchedulerType.StreamK])
 
-    for data_type in data_types:
-      # With No-SMEM epilogues
-      CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type, schedules)
-
-      if CudaToolkitVersionSatisfies(cuda_version, 12, 1):
-        # Persistent kernels with TMA epilogues
-        CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type,
-          [[KernelScheduleType.TmaWarpSpecializedCooperative,             EpilogueScheduleType.TmaWarpSpecializedCooperative],
-           [KernelScheduleType.TmaWarpSpecializedPingpongFP8FastAccum,    EpilogueScheduleType.TmaWarpSpecialized],
-           [KernelScheduleType.TmaWarpSpecializedCooperativeFP8FastAccum, EpilogueScheduleType.TmaWarpSpecializedCooperative]])
-
-        # Small tiles
-        CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions_small, data_type,
-           [[KernelScheduleType.TmaWarpSpecializedPingpongFP8FastAccum, EpilogueScheduleType.TmaWarpSpecialized],
-            [KernelScheduleType.TmaWarpSpecializedPingpongFP8FastAccum, EpilogueScheduleType.NoSmemWarpSpecialized]])
-
-        # Large tiles
-        CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions_large, data_types_large_tile,
-          [[KernelScheduleType.TmaWarpSpecializedCooperative,             EpilogueScheduleType.TmaWarpSpecializedCooperative],
-           [KernelScheduleType.TmaWarpSpecializedCooperativeFP8FastAccum, EpilogueScheduleType.TmaWarpSpecializedCooperative]])
-
-        # Add stream-K variants (with and without TMA epilogues)
-        CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type, stream_k_schedules, tile_schedulers=[TileSchedulerType.StreamK])
-        CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type,
-          [[KernelScheduleType.TmaWarpSpecializedCooperative,             EpilogueScheduleType.TmaWarpSpecializedCooperative],
-           [KernelScheduleType.TmaWarpSpecializedCooperativeFP8FastAccum, EpilogueScheduleType.TmaWarpSpecializedCooperative]],
-          tile_schedulers=[TileSchedulerType.StreamK])
 
-#
 def GenerateSM90_TensorOp_fp8_WGMMA_alignx_gemm(manifest, cuda_version):
   if not CudaToolkitVersionSatisfies(cuda_version, 12, 0):
     return
 
+  instantiation_level = manifest.get_sm90_instantiation_level(pruned_level=0, default_level=101, exhaustive_level=9999)
+  is_aligned = False
+
   # layouts for ABC and their alignments
   layouts = [
     [[LayoutType.RowMajor, 8], [LayoutType.ColumnMajor, 8], [LayoutType.ColumnMajor, 1]],  # TN Layout
     [[LayoutType.RowMajor, 4], [LayoutType.ColumnMajor, 4], [LayoutType.ColumnMajor, 1]],  # TN Layout
   ]
 
-  math_instructions = [
-    # inst 64x128x32
-    MathInstruction(
-      [64, 128, 32],
-      DataType.e4m3, DataType.e4m3, DataType.f32,
-      OpcodeClass.TensorOp,
-      MathOperation.multiply_add),
-    MathInstruction(
-      [64, 128, 32],
-      DataType.e4m3, DataType.e5m2, DataType.f32,
-      OpcodeClass.TensorOp,
-      MathOperation.multiply_add),
-    MathInstruction(
-      [64, 128, 32],
-      DataType.e5m2, DataType.e4m3, DataType.f32,
-      OpcodeClass.TensorOp,
-      MathOperation.multiply_add),
-    MathInstruction(
-      [64, 128, 32],
-      DataType.e5m2, DataType.e5m2, DataType.f32,
-      OpcodeClass.TensorOp,
-      MathOperation.multiply_add),
-    # inst 64x64x32
-    # MathInstruction(
-    #   [64, 64, 32],
-    #   DataType.e4m3, DataType.e4m3, DataType.f32,
-    #   OpcodeClass.TensorOp,
-    #   MathOperation.multiply_add),
-    # MathInstruction(
-    #   [64, 64, 32],
-    #   DataType.e4m3, DataType.e5m2, DataType.f32,
-    #   OpcodeClass.TensorOp,
-    #   MathOperation.multiply_add),
-    # MathInstruction(
-    #   [64, 64, 32],
-    #   DataType.e5m2, DataType.e4m3, DataType.f32,
-    #   OpcodeClass.TensorOp,
-    #   MathOperation.multiply_add),
-    # MathInstruction(
-    #   [64, 64, 32],
-    #   DataType.e5m2, DataType.e5m2, DataType.f32,
-    #   OpcodeClass.TensorOp,
-    #   MathOperation.multiply_add),
-  ]
-
-  min_cc = 90
-  max_cc = 90
-
-  for math_inst in math_instructions:
-    data_types = [
-      {
-        "a_type"   : math_inst.element_a,
-        "b_type"   : math_inst.element_b,
-        "c_type"   : DataType.f32,
-        "d_type"   : DataType.f32,
-        "acc_type" : math_inst.element_accumulator,
-        "epi_type" : math_inst.element_accumulator
-      },
-      {
-        "a_type"   : math_inst.element_a,
-        "b_type"   : math_inst.element_b,
-        "c_type"   : DataType.f32,
-        "d_type"   : DataType.e4m3,
-        "acc_type" : math_inst.element_accumulator,
-        "epi_type" : math_inst.element_accumulator
-      },
-      {
-        "a_type"   : math_inst.element_a,
-        "b_type"   : math_inst.element_b,
-        "c_type"   : DataType.f32,
-        "d_type"   : DataType.e5m2,
-        "acc_type" : math_inst.element_accumulator,
-        "epi_type" : math_inst.element_accumulator
-      },
-      {
-        "a_type"   : math_inst.element_a,
-        "b_type"   : math_inst.element_b,
-        "c_type"   : DataType.bf16,
-        "d_type"   : DataType.bf16,
-        "acc_type" : math_inst.element_accumulator,
-        "epi_type" : math_inst.element_accumulator
-      },
-      {
-        "a_type"   : math_inst.element_a,
-        "b_type"   : math_inst.element_b,
-        "c_type"   : DataType.bf16,
-        "d_type"   : DataType.e4m3,
-        "acc_type" : math_inst.element_accumulator,
-        "epi_type" : math_inst.element_accumulator
-      },
-      {
-        "a_type"   : math_inst.element_a,
-        "b_type"   : math_inst.element_b,
-        "c_type"   : DataType.bf16,
-        "d_type"   : DataType.e5m2,
-        "acc_type" : math_inst.element_accumulator,
-        "epi_type" : math_inst.element_accumulator
-      },
-      {
-        "a_type"   : math_inst.element_a,
-        "b_type"   : math_inst.element_b,
-        "c_type"   : DataType.f16,
-        "d_type"   : DataType.f16,
-        "acc_type" : math_inst.element_accumulator,
-        "epi_type" : math_inst.element_accumulator
-      },
-      {
-        "a_type"   : math_inst.element_a,
-        "b_type"   : math_inst.element_b,
-        "c_type"   : DataType.f16,
-        "d_type"   : DataType.e4m3,
-        "acc_type" : math_inst.element_accumulator,
-        "epi_type" : math_inst.element_accumulator
-      },
-      {
-        "a_type"   : math_inst.element_a,
-        "b_type"   : math_inst.element_b,
-        "c_type"   : DataType.f16,
-        "d_type"   : DataType.e5m2,
-        "acc_type" : math_inst.element_accumulator,
-        "epi_type" : math_inst.element_accumulator
-      },
-    ]
+  math_instructions = generate_fp8_math_instructions_sm90(instantiation_level)
+  tile_descriptions = generate_tile_descriptions_sm90(
+      math_instructions=math_instructions,
+      is_aligned=is_aligned,
+      level=instantiation_level)
+
+  for tile_desc in tile_descriptions:
+    math_inst = tile_desc.math_instruction
+    data_types = [generate_data_types_from_math_instruction(math_inst)]
+    fp8_types = [DataType.e4m3, DataType.e5m2]
+    valid_types_for_d = [DataType.f32, DataType.bf16, DataType.f16, DataType.e4m3, DataType.e5m2]
+    valid_types_for_c = copy.deepcopy(valid_types_for_d)
+    valid_types_for_c.append(DataType.void)
+    for c_type, d_type in product(valid_types_for_c, valid_types_for_d):
+        data_types.append(
+            generate_data_types_from_math_instruction(
+                math_inst,
+                element_source=c_type,
+                element_dest=d_type,
+            )
+        )
 
-    if math_inst.instruction_shape[1] == 128:
-      tile_descriptions = [
-        # 128x128x128
-        TileDescription([math_inst.instruction_shape[0]*2, math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
-          0, [4, 1, 1], math_inst, min_cc, max_cc, [1,1,1]),
-      ]
+    for layout in layouts:
+        for data_type in data_types:
+            # Inconsistency: alignments aren't fixed in FP8
+            # layout = fix_alignments(data_type, layout, alignment_bits=128)
+
+            schedules, stream_k_schedules = get_valid_schedules(
+              tile_description=tile_desc,
+              cuda_version=cuda_version,
+              is_aligned=is_aligned,
+              data_types=data_type,
+              instantiation_level=instantiation_level,
+              layout=layout,
+            )
+
+            if len(schedules):
+              CreateGemmUniversal3xOperator(manifest, [layout], [tile_desc], data_type, schedules)
+              if len(stream_k_schedules):
+                assert CudaToolkitVersionSatisfies(cuda_version, 12, 1)
+                CreateGemmUniversal3xOperator(manifest, [layout], [tile_desc], data_type,
+                                              stream_k_schedules,
+                                              tile_schedulers=[TileSchedulerType.StreamK])
+
+
+def GenerateSM90_SparseTensorOp_fp8_WGMMA_gemm(manifest, cuda_version):
+  if not CudaToolkitVersionSatisfies(cuda_version, 12, 2):
+    return
 
-    # elif math_inst.instruction_shape[1] == 64:
-    #   tile_descriptions = [
-    #     # 256x64x128
-    #     TileDescription([math_inst.instruction_shape[0]*4, math_inst.instruction_shape[1], math_inst.instruction_shape[2]*4],
-    #       0, [4, 1, 1], math_inst, min_cc, max_cc, [1,1,1]),
-    #   ]
+  instantiation_level = manifest.get_sm90_instantiation_level(pruned_level=20, default_level=121, exhaustive_level=9999)
+  is_aligned = True
 
+  # layouts for ABC and their alignments
+  layouts = [
+    [[LayoutType.RowMajor, 32], [LayoutType.ColumnMajor, 16], [LayoutType.ColumnMajor, 1]],  # TN Layout
+  ]
+
+  math_instructions = make_sparse_math_instructions(generate_fp8_math_instructions_sm90(instantiation_level))
+  tile_descriptions = generate_tile_descriptions_sm90(
+      math_instructions=math_instructions,
+      is_aligned=is_aligned,
+      level=instantiation_level)
+
+  for tile_desc in tile_descriptions:
+    math_inst = tile_desc.math_instruction
+    data_types = []
+    fp8_types = [DataType.e4m3, DataType.e5m2]
+    valid_types_for_d = [DataType.f32, DataType.bf16, DataType.f16, DataType.e4m3, DataType.e5m2]
+    valid_types_for_c = copy.deepcopy(valid_types_for_d)
+    valid_types_for_c.append(DataType.void)
+    for c_type, d_type in product(valid_types_for_c, valid_types_for_d):
+        data_types.append(
+            generate_data_types_from_math_instruction(
+                math_inst,
+                element_source=c_type,
+                element_dest=d_type,
+            )
+        )
     else:
-      assert False, "math inst is not supported"
-
-    if CudaToolkitVersionSatisfies(cuda_version, 12, 1):
-      schedules = [
-        # [KernelScheduleType.ScheduleAuto, EpilogueScheduleType.ScheduleAuto],
-        [KernelScheduleType.CpAsyncWarpSpecializedCooperative, EpilogueScheduleType.NoSmemWarpSpecialized],
-        # [KernelScheduleType.CpAsyncWarpSpecializedPingpong, EpilogueScheduleType.NoSmemWarpSpecialized],
-        [KernelScheduleType.CpAsyncWarpSpecialized, EpilogueScheduleType.NoSmemWarpSpecialized],
-      ]
-      stream_k_schedules = [[KernelScheduleType.CpAsyncWarpSpecializedCooperative, EpilogueScheduleType.NoSmemWarpSpecialized]]
-    else:
-      schedules = [
-        # [KernelScheduleType.ScheduleAuto, EpilogueScheduleType.ScheduleAuto],
-        [KernelScheduleType.CpAsyncWarpSpecialized, EpilogueScheduleType.NoSmemWarpSpecialized]
-      ]
-      stream_k_schedules = []
-
+        for d_type in valid_types_for_d:
+            data_types.append(
+                generate_data_types_from_math_instruction(
+                    math_inst,
+                    element_source=DataType.void,
+                    element_dest=d_type,
+                )
+            )
 
-    for data_type in data_types:
-      # With No-SMEM epilogues
-      CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type, schedules)
-
-      if CudaToolkitVersionSatisfies(cuda_version, 12, 1):
-        # Persistent kernels with TMA epilogues
-        # CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type,
-        #   [[KernelScheduleType.CpAsyncWarpSpecializedCooperative,    EpilogueScheduleType.TmaWarpSpecializedCooperative]])
+    for layout in layouts:
+        for data_type in data_types:
+            # Inconsistency: alignments aren't fixed in FP8
+            # layout = fix_alignments(data_type, layout, alignment_bits=128)
+
+            schedules, stream_k_schedules = get_valid_schedules(
+              tile_description=tile_desc,
+              cuda_version=cuda_version,
+              is_aligned=is_aligned,
+              data_types=data_type,
+              instantiation_level=instantiation_level,
+              layout=layout,
+            )
+
+            if len(schedules):
+              CreateSparseGemmUniversal3xOperator(manifest, [layout], [tile_desc], data_type, schedules)
+              if len(stream_k_schedules):
+                assert CudaToolkitVersionSatisfies(cuda_version, 12, 1)
+                CreateSparseGemmUniversal3xOperator(manifest, [layout], [tile_desc], data_type,
+                                                    stream_k_schedules,
+                                                    tile_schedulers=[TileSchedulerType.StreamK])
 
-        # Add stream-K variants (with and without TMA epilogues)
-        CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type, stream_k_schedules, tile_schedulers=[TileSchedulerType.StreamK])
-        # CreateGemmUniversal3xOperator(manifest, layouts, tile_descriptions, data_type,
-        #   [[KernelScheduleType.CpAsyncWarpSpecializedCooperative,    EpilogueScheduleType.TmaWarpSpecializedCooperative]],
-        #   tile_schedulers=[TileSchedulerType.StreamK])
 
-#
 def GenerateSM90_TensorOp_1684(manifest, cuda_version):
 
   if not CudaToolkitVersionSatisfies(cuda_version, 11, 8):
@@ -7023,6 +6895,10 @@ def GenerateSM90(manifest, cuda_version):
   GenerateSM90_TensorOp_1684_symm_complex(manifest, cuda_version)
   GenerateSM90_TensorOp_1684_symm_complex_gaussian(manifest, cuda_version)
   GenerateSM90_Conv3x(manifest, cuda_version)
+  GenerateSM90_SparseTensorOp_16b_WGMMA_gemm(manifest, cuda_version)
+  GenerateSM90_SparseTensorOp_tf32_WGMMA_gemm(manifest, cuda_version)
+  GenerateSM90_SparseTensorOp_int8_WGMMA_gemm(manifest, cuda_version)
+  GenerateSM90_SparseTensorOp_fp8_WGMMA_gemm(manifest, cuda_version)
 
 ###################################################################################################
 
@@ -7073,6 +6949,7 @@ def define_parser():
   parser.add_argument("--disable-full-archs-compilation", action="store_true", required=False, help="Disable compilation for every archs in --architectures")
   parser.add_argument("--log-level", default='info', type=numeric_log_level, required=False,
                       help='Logging level to be used by the generator script')
+  parser.add_argument('--instantiation-level', type=str, default="", required=False, help="Instantiation level for SM90 kernels. Set to `max` and make sure `--kernels` is not empty to generate all possible configurations.")
   _add_package_disablement_flag(parser)
   return parser
 
diff --git a/python/cutlass_library/library.py b/python/cutlass_library/library.py
index 710cad31ca..be9eef20ed 100644
--- a/python/cutlass_library/library.py
+++ b/python/cutlass_library/library.py
@@ -69,6 +69,7 @@ class GeneratorTarget(enum.Enum):
 class DataType(enum.Enum):
   void = enum_auto()  # primarily used to disable C tensor for epilogues
   b1 = enum_auto()
+  u2 = enum_auto()
   u4 = enum_auto()
   u8 = enum_auto()
   u16 = enum_auto()
@@ -119,6 +120,7 @@ class DataType(enum.Enum):
 DataTypeNames = {
   DataType.void: "void",
   DataType.b1: "b1",
+  DataType.u2: "u2",
   DataType.u4: "u4",
   DataType.u8: "u8",
   DataType.u16: "u16",
@@ -156,6 +158,7 @@ class DataType(enum.Enum):
 DataTypeTag = {
   DataType.void: "void",
   DataType.b1: "cutlass::uint1b_t",
+  DataType.u2: "cutlass::uint2b_t",
   DataType.u4: "cutlass::uint4b_t",
   DataType.u8: "uint8_t",
   DataType.u16: "uint16_t",
@@ -193,6 +196,7 @@ class DataType(enum.Enum):
 DataTypeSize = {
   DataType.void: 0,
   DataType.b1: 1,
+  DataType.u2: 2,
   DataType.u4: 4,
   DataType.u8: 8,
   DataType.u16: 16,
@@ -584,12 +588,14 @@ class OpcodeClass(enum.Enum):
   OpcodeClass.Simt: 'simt',
   OpcodeClass.TensorOp: 'tensorop',
   OpcodeClass.WmmaTensorOp: 'wmma_tensorop',
+  OpcodeClass.SparseTensorOp: 'sptensorop',
 }
 
 OpcodeClassTag = {
   OpcodeClass.Simt: 'cutlass::arch::OpClassSimt',
   OpcodeClass.TensorOp: 'cutlass::arch::OpClassTensorOp',
   OpcodeClass.WmmaTensorOp: 'cutlass::arch::OpClassWmmaTensorOp',
+  OpcodeClass.SparseTensorOp: 'cutlass::arch::OpClassSparseTensorOp',
 }
 
 ###################################################################################################
diff --git a/python/cutlass_library/manifest.py b/python/cutlass_library/manifest.py
index b31d8dd23e..3e82e640c7 100644
--- a/python/cutlass_library/manifest.py
+++ b/python/cutlass_library/manifest.py
@@ -520,7 +520,9 @@ def __init__(self, args = None):
         raise RuntimeError("The list of architectures (CMake option CUTLASS_NVCC_ARCHS) must be semicolon-delimited.\nDon't use commas to separate the architectures; use semicolons.\nYou specified the list as: " + args.architectures)
       architectures = args.architectures.split(';') if len(args.architectures) else ['50',]
 
-      arch_conditional_cc = ['90a']
+      arch_conditional_cc = [
+        '90a', 
+      ]
       architectures = [x if x not in arch_conditional_cc else x.split('a')[0] for x in architectures]
 
       self.compute_capabilities = [int(x) for x in architectures]
@@ -560,6 +562,18 @@ def __init__(self, args = None):
     self.operation_count = 0
     self.operations_by_name = {}
     self.disable_full_archs_compilation = args.disable_full_archs_compilation
+    self.is_kernel_filter_set_to_all = args.instantiation_level == "max" and args.kernels != ''
+
+  def get_sm90_instantiation_level(self, pruned_level=0, default_level=111, exhaustive_level=9999):
+    # Non-negative integer which determines how many kernels are instantiated.
+    # 0 = 0000 generates the fewest kernels, 9999 generates all possible combinations.
+    # increasing first digit reduces schedule / mixed type pruning,
+    # increasing second digit generates more cluster sizes,
+    # increasing third digit generates more MMA shapes,
+    # increasing fourth digit generates more instruction shapes.
+    return exhaustive_level if self.is_kernel_filter_set_to_all else (
+      pruned_level if self.kernel_filter == '' else default_level
+    )
 
 
   def get_kernel_filters (self, kernelListFile):
@@ -601,6 +615,7 @@ def filter(self, operation):
     enabled = not (self.filter_by_cc)
 
     for cc in self.compute_capabilities:
+
       if cc >= operation.tile_description.minimum_compute_capability and \
          cc <= operation.tile_description.maximum_compute_capability and \
          (cc not in SharedMemPerCC or SharedMemPerCC[cc] >= CalculateSmemUsage(operation)):
diff --git a/python/cutlass_library/sm90_shapes.py b/python/cutlass_library/sm90_shapes.py
new file mode 100644
index 0000000000..034e75248d
--- /dev/null
+++ b/python/cutlass_library/sm90_shapes.py
@@ -0,0 +1,212 @@
+#################################################################################################
+#
+# Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+#################################################################################################
+
+"""
+Valid WGMMA shapes, MMA multipliers, and cluster sizes for SM90, associated with levels.
+These shape and level pairs are defined as dicts, where keys are shapes and values are their
+associated levels. If the user input level for that category (MMA multiplier, WGMMA shape, cluster
+size) is smaller than a shape's associated level, it will be excluded, and otherwise, included.
+Higher levels are therefore less likely emitted, but lower levels are more emitted more frequently.
+Level 0 is always emitted. The default behavior in `generator.py` is that level 1 is only emitted
+when the `--kernel` argument is non-empty.
+"""
+
+# NOTE: more combinations are possible here.
+# Levels [0, 3] exist in order to control exactly what configs are generated in different dtypes.
+# The rest are only used in the exhaustive mode (when the corresponding level digit is 9).
+# MMA multipliers are multiplied by MMA instruction shapes (WGMMA shapes) to get CTA shapes.
+SM90_MMA_MULTIPLIERS = {
+    (2, 1, 4): 0,
+    (1, 1, 4): 1,
+    (4, 1, 4): 2,
+    (2, 2, 4): 3,
+    (2, 1, 8): 4,
+    (4, 1, 8): 4,
+    (1, 1, 8): 4,
+    (2, 2, 8): 4,
+    (2, 1, 16): 5,
+    (4, 1, 16): 5,
+    (1, 1, 16): 5,
+    (2, 2, 16): 5,
+}
+
+# Level 0: only (1, 2, 1) -- fp8 dense gemms in pruned case
+# Level 1: clusters with 2 CTAs -- all but fp8 (s8, u8, f16, b16, f32, tf32) dense gemms in pruned case
+# Level 2: clusters with 1 or 2 CTAs
+# Level 3: clusters with 1, 2, or 4 CTAs
+# Level 4: clusters with 1, 2, 4, or 8 CTAs
+# Level 5: clusters with 1, 2, 4, 8, or 16 CTAs
+SM90_CLUSTER_SIZES = {
+    (1, 2, 1): 0,
+    (2, 1, 1): 1,
+    (1, 1, 1): 2,
+    (2, 2, 1): 3,
+    (1, 4, 1): 3,
+    (4, 1, 1): 3,
+    (2, 4, 1): 4,
+    (4, 2, 1): 4,
+    (1, 8, 1): 4,
+    (8, 1, 1): 4,
+    (4, 4, 1): 5,
+}
+
+
+# WGMMA shapes
+# Level 0: "default" shape only,
+# Level 1: additional shapes for the unpruned case (tf32 only)
+# Level 2: shapes that are all powers of 2
+# Level 3: all other shapes
+SM90_WGMMA_SHAPES_FP16_BF16_DENSE = {
+    (64, 8, 16): 2,
+    (64, 16, 16): 2,
+    (64, 24, 16): 3,
+    (64, 32, 16): 2,
+    (64, 40, 16): 3,
+    (64, 48, 16): 3,
+    (64, 56, 16): 3,
+    (64, 64, 16): 2,
+    (64, 72, 16): 3,
+    (64, 80, 16): 3,
+    (64, 88, 16): 3,
+    (64, 96, 16): 3,
+    (64, 104, 16): 3,
+    (64, 112, 16): 3,
+    (64, 120, 16): 3,
+    (64, 128, 16): 0,
+    (64, 136, 16): 3,
+    (64, 144, 16): 3,
+    (64, 152, 16): 3,
+    (64, 160, 16): 3,
+    (64, 168, 16): 3,
+    (64, 176, 16): 3,
+    (64, 184, 16): 3,
+    (64, 192, 16): 3,
+    (64, 200, 16): 3,
+    (64, 208, 16): 3,
+    (64, 216, 16): 3,
+    (64, 224, 16): 3,
+    (64, 232, 16): 3,
+    (64, 240, 16): 3,
+    (64, 248, 16): 3,
+    (64, 256, 16): 1,
+}
+
+SM90_WGMMA_SHAPES_TF32_DENSE = {
+    (64, 8, 8): 2,
+    (64, 16, 8): 2,
+    (64, 24, 8): 3,
+    (64, 32, 8): 2,
+    (64, 40, 8): 3,
+    (64, 48, 8): 3,
+    (64, 56, 8): 3,
+    (64, 64, 8): 2,
+    (64, 72, 8): 3,
+    (64, 80, 8): 3,
+    (64, 88, 8): 3,
+    (64, 96, 8): 3,
+    (64, 104, 8): 3,
+    (64, 112, 8): 3,
+    (64, 120, 8): 3,
+    (64, 128, 8): 0,
+    (64, 136, 8): 3,
+    (64, 144, 8): 3,
+    (64, 152, 8): 3,
+    (64, 160, 8): 3,
+    (64, 168, 8): 3,
+    (64, 176, 8): 3,
+    (64, 184, 8): 3,
+    (64, 192, 8): 3,
+    (64, 200, 8): 3,
+    (64, 208, 8): 3,
+    (64, 216, 8): 3,
+    (64, 224, 8): 3,
+    (64, 232, 8): 3,
+    (64, 240, 8): 3,
+    (64, 248, 8): 3,
+    (64, 256, 8): 1,
+}
+
+SM90_WGMMA_SHAPES_FP8_DENSE = {
+    (64, 8, 32): 2,
+    (64, 16, 32): 2,
+    (64, 24, 32): 3,
+    (64, 32, 32): 2,
+    (64, 40, 32): 3,
+    (64, 48, 32): 3,
+    (64, 56, 32): 3,
+    (64, 64, 32): 2,
+    (64, 72, 32): 3,
+    (64, 80, 32): 3,
+    (64, 88, 32): 3,
+    (64, 96, 32): 3,
+    (64, 104, 32): 3,
+    (64, 112, 32): 3,
+    (64, 120, 32): 3,
+    (64, 128, 32): 0,
+    (64, 136, 32): 3,
+    (64, 144, 32): 3,
+    (64, 152, 32): 3,
+    (64, 160, 32): 3,
+    (64, 168, 32): 3,
+    (64, 176, 32): 3,
+    (64, 184, 32): 3,
+    (64, 192, 32): 3,
+    (64, 200, 32): 3,
+    (64, 208, 32): 3,
+    (64, 216, 32): 3,
+    (64, 224, 32): 3,
+    (64, 232, 32): 3,
+    (64, 240, 32): 3,
+    (64, 248, 32): 3,
+    (64, 256, 32): 1,
+}
+
+SM90_WGMMA_SHAPES_INT8_DENSE = {
+    (64, 8, 32): 2,
+    (64, 16, 32): 2,
+    (64, 24, 32): 3,
+    (64, 32, 32): 2,
+    (64, 48, 32): 3,
+    (64, 64, 32): 2,
+    (64, 80, 32): 3,
+    (64, 96, 32): 3,
+    (64, 112, 32): 3,
+    (64, 128, 32): 0,
+    (64, 144, 32): 3,
+    (64, 160, 32): 3,
+    (64, 176, 32): 3,
+    (64, 192, 32): 3,
+    (64, 208, 32): 3,
+    (64, 224, 32): 3,
+    (64, 240, 32): 3,
+    (64, 256, 32): 1,
+}
diff --git a/python/cutlass_library/sm90_utils.py b/python/cutlass_library/sm90_utils.py
new file mode 100644
index 0000000000..08fcd547d7
--- /dev/null
+++ b/python/cutlass_library/sm90_utils.py
@@ -0,0 +1,601 @@
+#################################################################################################
+#
+# Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+#
+#################################################################################################
+
+"""
+Utilities for enumerating CUTLASS library SM90 kernels
+"""
+
+import argparse
+import enum
+from itertools import product
+import math
+import logging
+import os.path
+import shutil
+import sys
+import copy
+from typing import Any, Optional, Sequence, Tuple
+
+try:
+  import builtins
+  if hasattr(builtins, "CUTLASS_IGNORE_PACKAGE") and CUTLASS_IGNORE_PACKAGE == True:
+    raise ImportError("Disabling attempt to import cutlass_library")
+  from cutlass_library.library import *
+except ImportError:
+  from library import *
+
+# NOTE: this is a duplicate of CudaToolkitVersionSatisfies in generator.py
+def CudaToolkitVersionSatisfies(semantic_ver_string, major, minor, patch = 0):
+
+  # by default, use the latest CUDA Toolkit version
+  cuda_version = [11, 0, 132]
+
+  # Update cuda_version based on parsed string
+  if semantic_ver_string != '':
+    for i, x in enumerate([int(x) for x in semantic_ver_string.split('.')[:3]]):
+      if i < len(cuda_version):
+        cuda_version[i] = x
+      else:
+        cuda_version.append(x)
+  return cuda_version >= [major, minor, patch]
+
+#### Step 0: define levels
+
+# One integer level controls multiple "generators" and how many
+# combinations they generate. That is the "global" level.
+# "Generators" are WGMMA shapes, MMA multipliers, cluster sizes, and
+# anything that is eventually involved in the Cartesian product
+# which yields our kernel configurations.
+# For simplicity, each generator defines their own levels, 
+# starting from 0. As a rule we assume 10 or fewer levels, making
+# their level a digit.
+# The "global" level simply stacks these digits and represents them
+# as a single integer.
+# 
+# For example, level 500 indicates cluster sizes are at level 5, MMA
+# multipliers are at level 0, and WGMMA shapes are at level 0 as well.
+#
+# Here we define the global level to generator level mappings.
+
+
+def get_wgmma_level_from_global_level(global_level: int):
+    return global_level % 10
+
+
+def get_mma_level_from_global_level(global_level: int):
+    return (global_level // 10) % 10
+
+
+def get_cluster_level_from_global_level(global_level: int):
+    return (global_level // 100) % 10
+
+
+def get_pruning_level_from_global_level(global_level: int):
+    return (global_level // 1000) % 10
+
+
+#### Step 1: generate MMA instruction shapes based on levels
+
+try:
+    from .sm90_shapes import (
+        SM90_MMA_MULTIPLIERS,
+        SM90_CLUSTER_SIZES,
+        SM90_WGMMA_SHAPES_TF32_DENSE,
+        SM90_WGMMA_SHAPES_FP16_BF16_DENSE,
+        SM90_WGMMA_SHAPES_FP8_DENSE,
+        SM90_WGMMA_SHAPES_INT8_DENSE,
+    )
+except:
+    from sm90_shapes import (
+        SM90_MMA_MULTIPLIERS,
+        SM90_CLUSTER_SIZES,
+        SM90_WGMMA_SHAPES_TF32_DENSE,
+        SM90_WGMMA_SHAPES_FP16_BF16_DENSE,
+        SM90_WGMMA_SHAPES_FP8_DENSE,
+        SM90_WGMMA_SHAPES_INT8_DENSE,
+    )
+
+
+def generate_tf32_math_instruction_shapes_sm90(level: int):
+    assert isinstance(level, int) and level >= 0
+    filtered_list_of_wgmma_shapes = [
+        wgmma_shape for wgmma_shape, min_level in SM90_WGMMA_SHAPES_TF32_DENSE.items() if level >= min_level
+    ]
+    return filtered_list_of_wgmma_shapes
+
+def generate_fp16_bf16_math_instruction_shapes_sm90(level: int):
+    assert isinstance(level, int) and level >= 0
+    filtered_list_of_wgmma_shapes = [
+        wgmma_shape for wgmma_shape, min_level in SM90_WGMMA_SHAPES_FP16_BF16_DENSE.items() if level >= min_level
+    ]
+    return filtered_list_of_wgmma_shapes
+
+def generate_fp8_math_instruction_shapes_sm90(level: int):
+    assert isinstance(level, int) and level >= 0
+    filtered_list_of_wgmma_shapes = [
+        wgmma_shape for wgmma_shape, min_level in SM90_WGMMA_SHAPES_FP8_DENSE.items() if level >= min_level
+    ]
+    return filtered_list_of_wgmma_shapes
+
+def generate_int8_math_instruction_shapes_sm90(level: int):
+    assert isinstance(level, int) and level >= 0
+    filtered_list_of_wgmma_shapes = [
+        wgmma_shape for wgmma_shape, min_level in SM90_WGMMA_SHAPES_INT8_DENSE.items() if level >= min_level
+    ]
+    return filtered_list_of_wgmma_shapes
+
+###########
+
+def generate_tf32_math_instructions_sm90(level: int):
+    wgmma_level = get_wgmma_level_from_global_level(level)
+    math_instructions = []
+    for math_instruction_shape in generate_tf32_math_instruction_shapes_sm90(wgmma_level):
+        math_instructions.append(
+          MathInstruction(
+              math_instruction_shape,
+              DataType.tf32, DataType.tf32, DataType.f32,
+              OpcodeClass.TensorOp,
+              MathOperation.multiply_add)
+        )
+    return math_instructions
+
+def generate_fp16_bf16_math_instructions_sm90(level: int):
+    wgmma_level = get_wgmma_level_from_global_level(level)
+    math_instructions = []
+    for math_instruction_shape in generate_fp16_bf16_math_instruction_shapes_sm90(wgmma_level):
+        math_instructions += [
+          MathInstruction(
+              math_instruction_shape,
+              DataType.f16, DataType.f16, DataType.f16,
+              OpcodeClass.TensorOp,
+              MathOperation.multiply_add),
+          MathInstruction(
+              math_instruction_shape,
+              DataType.f16, DataType.f16, DataType.f32,
+              OpcodeClass.TensorOp,
+              MathOperation.multiply_add),
+          MathInstruction(
+              math_instruction_shape,
+              DataType.bf16, DataType.bf16, DataType.f32,
+              OpcodeClass.TensorOp,
+              MathOperation.multiply_add),
+        ]
+    return math_instructions
+
+def generate_fp8_math_instructions_sm90(level: int):
+    wgmma_level = get_wgmma_level_from_global_level(level)
+    math_instructions = []
+    for math_instruction_shape in generate_fp8_math_instruction_shapes_sm90(wgmma_level):
+        math_instructions += [
+          MathInstruction(
+              math_instruction_shape,
+              DataType.e4m3, DataType.e4m3, DataType.f32,
+              OpcodeClass.TensorOp,
+              MathOperation.multiply_add),
+          MathInstruction(
+              math_instruction_shape,
+              DataType.e4m3, DataType.e5m2, DataType.f32,
+              OpcodeClass.TensorOp,
+              MathOperation.multiply_add),
+          MathInstruction(
+              math_instruction_shape,
+              DataType.e5m2, DataType.e4m3, DataType.f32,
+              OpcodeClass.TensorOp,
+              MathOperation.multiply_add),
+          MathInstruction(
+              math_instruction_shape,
+              DataType.e5m2, DataType.e5m2, DataType.f32,
+              OpcodeClass.TensorOp,
+              MathOperation.multiply_add),
+        ]
+    return math_instructions
+
+def generate_int8_math_instructions_sm90(level: int):
+    wgmma_level = get_wgmma_level_from_global_level(level)
+    math_instructions = []
+    for math_instruction_shape in generate_int8_math_instruction_shapes_sm90(wgmma_level):
+        math_instructions += [
+          MathInstruction(
+              math_instruction_shape,
+              DataType.s8, DataType.s8, DataType.s32,
+              OpcodeClass.TensorOp,
+              MathOperation.multiply_add),
+          MathInstruction(
+              math_instruction_shape,
+              DataType.u8, DataType.u8, DataType.s32,
+              OpcodeClass.TensorOp,
+              MathOperation.multiply_add),
+        ]
+    return math_instructions
+
+def make_sparse_math_instructions(math_instructions):
+    sparse_instructions = []
+    for inst in math_instructions:
+        if inst.opcode_class == OpcodeClass.TensorOp:
+            sparse_instructions.append(MathInstruction(
+                (inst.instruction_shape[0], inst.instruction_shape[1], inst.instruction_shape[2] * 2),
+                inst.element_a, inst.element_b, inst.element_accumulator,
+                OpcodeClass.SparseTensorOp,
+                inst.math_operation),)
+    return sparse_instructions
+
+
+#### Step 2: generate tile descriptions from math instruction shapes
+
+def is_tile_desc_valid(tile_description):
+    if tile_description.minimum_compute_capability != 90 or tile_description.maximum_compute_capability != 90:
+        return False
+
+    element_a, element_b, element_accum = (
+        tile_description.math_instruction.element_a,
+        tile_description.math_instruction.element_b,
+        tile_description.math_instruction.element_accumulator
+    )
+
+    cluster_shape, cta_shape, inst_shape = (
+        tile_description.cluster_shape,
+        tile_description.threadblock_shape,
+        tile_description.math_instruction.instruction_shape
+    )
+    grid_size = (
+        cta_shape[0] * cluster_shape[0] +
+        cta_shape[1] * cluster_shape[1] +
+        cta_shape[2] * cluster_shape[2]
+    )
+    cluster_size = cluster_shape[0] * cluster_shape[1] * cluster_shape[2]
+
+    # Maximum number of CTAs per cluster is 8 for Hopper, but up to 16 is
+    # allowed for non portable clusters.
+    if cluster_size > 16 or cluster_size < 1:
+        return False
+
+    if grid_size < 1:
+        return False
+
+    # SM90 WGMMA shapes are always 64 across M, therefore
+    # CTA shape across M must always be a multiple of 64.
+    if cta_shape[0] < 64 or cta_shape[0] % 64 != 0:
+        return False
+
+    # The minimum WGMMA shape across N is 8, and increments
+    # vary across different dtypes, but they're never smaller
+    # than 8. The minimum CTA shape allowed across N though is 16.
+    if cta_shape[1] < 16 or cta_shape[1] % 8 != 0:
+        return False
+
+    # SM90 WGMMA shapes across K are always 8 for 32 bit dense
+    # operations, 16 for 16 bit, and 32 for 8 bit. In any case,
+    # the CTA shape across K should be a multiple of 8 and at least
+    # twice the WGMMA shape across K.
+    if cta_shape[2] < 16 or cta_shape[2] % 8 != 0:
+        return False
+
+    # Minimum of 2 stages
+    if cta_shape[2] < inst_shape[2] or cta_shape[2] % inst_shape[2] != 0 or cta_shape[2] / inst_shape[2] < 2:
+        return False
+
+    # CTA shape upper bound: <256, 256, 256>
+    if cta_shape[0] > 256 or cta_shape[1] > 256 or cta_shape[2] > 256:
+        return False
+
+    return True
+
+def get_mma_multipliers(level: int):
+    assert isinstance(level, int) and level >= 0
+    mma_level = get_mma_level_from_global_level(level)
+    return [
+        mma_mul for mma_mul, mma_min_level in SM90_MMA_MULTIPLIERS.items() if mma_level >= mma_min_level
+    ]
+
+def get_cluster_sizes(level: int, is_aligned: bool):
+    if not is_aligned:
+        return [(1, 1, 1)]
+    assert isinstance(level, int) and level >= 0
+    cluster_level = get_cluster_level_from_global_level(level)
+    return [
+        cluster_size for cluster_size, cluster_min_level in SM90_CLUSTER_SIZES.items() if cluster_level >= cluster_min_level
+    ]
+
+def generate_tile_descriptions_sm90(math_instructions, is_aligned: bool, level: int):
+    tile_descriptions = set()
+    mma_multipliers, cluster_sizes = get_mma_multipliers(level), get_cluster_sizes(level, is_aligned)
+    for math_inst, mma_mul, cluster_size in product(math_instructions, mma_multipliers, cluster_sizes):
+        tile_desc = TileDescription(
+            threadblock_shape=[
+                math_inst.instruction_shape[0] * mma_mul[0],
+                math_inst.instruction_shape[1] * mma_mul[1],
+                math_inst.instruction_shape[2] * mma_mul[2]
+            ],
+            stages=0,
+            warp_count=[4, 1, 1],
+            math_instruction=math_inst,
+            min_compute=90,
+            max_compute=90,
+            cluster_shape=cluster_size)
+        # For sparse kernels K-tile is twice as large (due to 2x MMA-K size)
+        # Reduce it to same size as dense to afford more smem stages
+        if math_inst.opcode_class == OpcodeClass.SparseTensorOp:
+            tile_desc.threadblock_shape[2] = tile_desc.threadblock_shape[2] // 2
+        if is_tile_desc_valid(tile_desc):
+            tile_descriptions.add(tile_desc)
+
+    return tile_descriptions
+
+#### Step 3: map tile description to valid schedules
+
+def is_tile_desc_compatible_with_cooperative(tile_description):
+    # Cooperative kernels require a minimum CTA-M of 128
+    return tile_description.threadblock_shape[0] >= 128
+
+
+def can_tile_desc_use_shmem_in_epilogue(tile_description, data_types):
+    dtype_a, dtype_b, dtype_c, dtype_d, dtype_acc, dtype_epi = (
+        data_types["a_type"],
+        data_types["b_type"],
+        data_types["c_type"],
+        data_types["d_type"],
+        data_types["acc_type"],
+        data_types["epi_type"]
+    )
+    mn = tile_description.threadblock_shape[0] * tile_description.threadblock_shape[1]
+    bitsize_c, bitsize_d = DataTypeSize[dtype_c], DataTypeSize[dtype_d]
+
+    shmem_bits_c, shmem_bits_d = bitsize_c * mn, bitsize_d * mn
+    shmem_bits_total = shmem_bits_c + shmem_bits_d
+    # Magic number: 2^20
+    # Existing logic suggested that tile shape 256x128 (or 128x256)
+    # would run out of shmem if D is FP32, and source is needed.
+    # That would be 256 * 128 * 32 == 2^21 (~262 KB), which is over the limit.
+    # Hopper's max shmem size is 228 KB, and 2^20 ~= 131 KB.
+    # Since epilogue can't possibly use ALL of the shmem available
+    # we can just settle on 2^20 bits (~ 131 KB) being the upper bound
+    # we would allow for epilogue.
+    # This can be different for non-persistent kernels where epilogue and
+    # mainloop shmem is shared.
+    if shmem_bits_total > 2 ** 20:
+        return False
+
+    return True
+
+
+def get_valid_schedules(tile_description, cuda_version, is_aligned, data_types, layout,
+                        instantiation_level, enable_fp8_fast_acc=True):
+    # Level 0: prune according to existing generator.py behavior
+    # Level >= 1: no pruning
+    level = get_pruning_level_from_global_level(instantiation_level)
+    schedules = []
+    stream_k_schedules = []
+
+    if not is_tile_desc_valid(tile_description):
+        return schedules, stream_k_schedules
+
+    FP16_TYPES = [DataType.f16, DataType.bf16]
+    is_fp16 = data_types["a_type"] in FP16_TYPES and data_types["b_type"] in FP16_TYPES
+
+    FP8_TYPES = [DataType.e4m3, DataType.e5m2]
+    is_fp8 = data_types["a_type"] in FP8_TYPES and data_types["b_type"] in FP8_TYPES
+    can_do_fp8_fast_accum = is_fp8 and enable_fp8_fast_acc
+
+    FP32_TYPES = [DataType.f32, DataType.tf32]
+    is_fp32 = data_types["a_type"] in FP32_TYPES and data_types["b_type"] in FP32_TYPES
+    requires_transposed_epilogue = is_fp32 and layout[0][0] == LayoutType.RowMajor and layout[1][0] == LayoutType.RowMajor
+
+    is_sparse = tile_description.math_instruction.opcode_class == OpcodeClass.SparseTensorOp
+
+    can_do_cooperative = is_tile_desc_compatible_with_cooperative(tile_description)
+    can_do_tma_epilogue = is_aligned and not requires_transposed_epilogue and can_tile_desc_use_shmem_in_epilogue(tile_description, data_types)
+
+    default_epilogue = EpilogueScheduleType.NoSmemWarpSpecialized if not requires_transposed_epilogue else EpilogueScheduleType.EpilogueTransposed
+    auto_epilogue = EpilogueScheduleType.ScheduleAuto if not requires_transposed_epilogue else EpilogueScheduleType.EpilogueTransposed
+
+    cta_m, cta_n, cta_k = (
+        tile_description.threadblock_shape[0],
+        tile_description.threadblock_shape[1],
+        tile_description.threadblock_shape[2]
+    )
+    c_type = data_types["c_type"]
+    d_type = data_types["d_type"]
+    is_void_c = c_type == DataType.void
+
+    # Early pruning
+    if level < 1:
+        # Don't stamp out FP16/BF16 kernels smaller than or equal to 64x128x64
+        if is_fp16 and cta_m <= 64 and cta_n <= 128 and cta_k <= 64:
+            return [], []
+
+        # FP8 configs with CTA tile larger than or equal to 256x128x128 limit data types and schedules
+        is_large_fp8_tile = is_fp8 and cta_m >= 256 and cta_n >= 128 and cta_k >= 128
+        if is_large_fp8_tile:
+            # Only void-C, and only FP8 outputs allowed
+            if not is_void_c or d_type not in FP8_TYPES:
+                return [], []
+            if CudaToolkitVersionSatisfies(cuda_version, 12, 1) and can_do_cooperative and can_do_tma_epilogue:
+                return [
+                    [
+                        KernelScheduleType.TmaWarpSpecializedCooperative if not is_sparse else KernelScheduleType.TmaWarpSpecializedCooperativeFP8FastAccum,
+                        EpilogueScheduleType.TmaWarpSpecializedCooperative
+                    ],
+                    [
+                        KernelScheduleType.TmaWarpSpecializedCooperativeFP8FastAccum,
+                        EpilogueScheduleType.TmaWarpSpecializedCooperative
+                    ],
+                ] , []
+            return [], []
+
+        if is_fp8 and not is_large_fp8_tile:
+            valid_dtypes_for_c = [DataType.f32, DataType.bf16, DataType.f16]
+            # Prune all configs with fp8 source, and all configs with non-fp8 output
+            # that have different dtypes for source and output.
+            if c_type not in valid_dtypes_for_c or (d_type not in FP8_TYPES and c_type != d_type):
+                return [], []
+
+        # FP32/TF32 kernels don't stamp out void-C
+        if is_fp32 and is_void_c:
+            return [], []
+
+    # Void-c only makes a difference for TMA epilogues
+    if is_void_c and not can_do_tma_epilogue:
+        return [], []
+
+    if not is_aligned:
+        schedules = [[KernelScheduleType.CpAsyncWarpSpecialized,
+                    default_epilogue]]
+        stream_k_schedules = []
+
+        if CudaToolkitVersionSatisfies(cuda_version, 12, 1) and can_do_cooperative:
+            schedules.append([
+                KernelScheduleType.CpAsyncWarpSpecializedCooperative,
+                default_epilogue
+            ])
+            stream_k_schedules.append([
+                KernelScheduleType.CpAsyncWarpSpecializedCooperative,
+                default_epilogue
+            ])
+
+        return schedules, stream_k_schedules
+
+    schedules = []
+    # Pruning: emit Void-C kernels with persistent kernels only
+    if level >= 1 or not is_void_c:
+        # Pruning: don't stamp out fp8 kernels with auto schedule
+        if not is_fp8:
+            schedules.append([KernelScheduleType.ScheduleAuto, auto_epilogue])
+        if not (is_fp8 and is_sparse):
+            schedules.append([KernelScheduleType.TmaWarpSpecialized, default_epilogue])
+    stream_k_schedules = []
+
+    if CudaToolkitVersionSatisfies(cuda_version, 12, 1):
+        # Pruning: don't stamp out fp8 ping-ponging kernel with non-tma epilogue
+        if not is_fp8 or level >= 1:
+            schedules.append([KernelScheduleType.TmaWarpSpecializedPingpong, default_epilogue])
+
+        if can_do_fp8_fast_accum:
+            schedules.append([KernelScheduleType.TmaWarpSpecializedFP8FastAccum, default_epilogue])
+            schedules.append([KernelScheduleType.TmaWarpSpecializedPingpongFP8FastAccum, default_epilogue])
+
+        if can_do_cooperative:
+            # Sparse kernels only support FastAccum FP8 mainloop
+            if not (is_fp8 and is_sparse):
+                schedules.append([
+                    KernelScheduleType.TmaWarpSpecializedCooperative,
+                    default_epilogue
+                ])
+                stream_k_schedules.append([
+                    KernelScheduleType.TmaWarpSpecializedCooperative,
+                    default_epilogue
+                ])
+            if can_do_fp8_fast_accum:
+                schedules.append([
+                    KernelScheduleType.TmaWarpSpecializedCooperativeFP8FastAccum,
+                    default_epilogue
+                ])
+                stream_k_schedules.append([
+                    KernelScheduleType.TmaWarpSpecializedCooperativeFP8FastAccum,
+                    default_epilogue
+                ])
+
+        # persistent kernels with TMA epilogues
+        if can_do_tma_epilogue:
+            assert not requires_transposed_epilogue
+            # Inconsistency: fp8 pingpong only gets stamped out with fast accum
+            if not is_fp8 or level >= 1:
+                schedules.append([
+                    KernelScheduleType.TmaWarpSpecializedPingpong,
+                    EpilogueScheduleType.TmaWarpSpecialized
+                ])
+            if can_do_fp8_fast_accum:
+                schedules.append([
+                    KernelScheduleType.TmaWarpSpecializedPingpongFP8FastAccum,
+                    EpilogueScheduleType.TmaWarpSpecialized
+                ])
+            if can_do_cooperative:
+                # Sparse kernels only support FastAccum FP8 mainloop
+                if not (is_fp8 and is_sparse):
+                    schedules.append([
+                        KernelScheduleType.TmaWarpSpecializedCooperative,
+                        EpilogueScheduleType.TmaWarpSpecializedCooperative
+                    ])
+                    stream_k_schedules.append([
+                        KernelScheduleType.TmaWarpSpecializedCooperative,
+                        EpilogueScheduleType.TmaWarpSpecializedCooperative
+                    ])
+                if can_do_fp8_fast_accum:
+                    schedules.append([
+                        KernelScheduleType.TmaWarpSpecializedCooperativeFP8FastAccum,
+                        EpilogueScheduleType.TmaWarpSpecializedCooperative
+                    ])
+                    stream_k_schedules.append([
+                        KernelScheduleType.TmaWarpSpecializedCooperativeFP8FastAccum,
+                        EpilogueScheduleType.TmaWarpSpecializedCooperative
+                    ])
+
+    return schedules, stream_k_schedules
+
+
+#### Misc: helpers
+
+def generate_data_types_from_math_instruction(math_instruction, element_source = None, element_dest = None, element_epilogue = None):
+    element_a, element_b = math_instruction.element_a, math_instruction.element_b
+    element_accumulator = math_instruction.element_accumulator
+    element_c = element_source or element_accumulator
+    element_d = element_dest or element_accumulator
+    element_epilogue = element_epilogue or element_accumulator
+    data_types = {
+        "a_type"   : element_a,
+        "b_type"   : element_b,
+        "c_type"   : element_c,
+        "d_type"   : element_d,
+        "acc_type" : element_accumulator,
+        "epi_type" : element_epilogue
+    }
+    return data_types
+
+def fix_alignments(data_types, layout, alignment_bits = 128):
+    operand_keys = ["a_type", "b_type", "c_type"]
+    operands_to_fix = ["c_type"]
+    new_layout = []
+    assert len(layout) == len(operand_keys)
+    for i, k in enumerate(operand_keys):
+        assert k in data_types and data_types[k] in DataTypeSize
+        dtype = data_types[k]
+        dtype_size_bits = DataTypeSize[dtype]
+
+        layout_type = layout[i][0]
+        layout_alignment = layout[i][1]
+
+        # Don't modify alignment if dtype's been changed to void
+        if k in operands_to_fix and dtype_size_bits >= 1:
+            layout_alignment = alignment_bits // dtype_size_bits
+
+        new_layout.append([layout_type, layout_alignment])
+
+    return new_layout
diff --git a/python/setup_library.py b/python/setup_library.py
index 870840324c..bbe97c0662 100644
--- a/python/setup_library.py
+++ b/python/setup_library.py
@@ -36,7 +36,7 @@
 def perform_setup():
     setup(
         name='cutlass_library',
-        version='3.5.1',
+        version='3.6.0',
         description='CUTLASS library generation scripts',
         packages=['cutlass_library']
     )
diff --git a/python/setup_pycute.py b/python/setup_pycute.py
index 24e30e9bc6..a57e37a40a 100644
--- a/python/setup_pycute.py
+++ b/python/setup_pycute.py
@@ -36,7 +36,7 @@
 def perform_setup():
     setup(
         name='pycute',
-        version='3.5.1',
+        version='3.6.0',
         description='Python implementation of CuTe',
         packages=['pycute'],
     )
diff --git a/test/python/cutlass/evt/evt_compute_sm80_90.py b/test/python/cutlass/evt/evt_compute_sm80_90.py
index 36cee7878c..3f9996cfcf 100644
--- a/test/python/cutlass/evt/evt_compute_sm80_90.py
+++ b/test/python/cutlass/evt/evt_compute_sm80_90.py
@@ -95,6 +95,29 @@ def evt_func_call(accum, C, alpha, beta, gamma):
             result_keys = ["D"]
             launcher.verify((m, n, k), input_keys, result_keys, l)
 
+    def test_func_call2(self):
+        """
+        Test Function call
+        """
+
+        def evt_func_call2(accum, C, alpha, beta):
+            D = maximum(alpha * accum + beta * C, 0.0)
+            return D
+
+        for m, n, k, l in self.get_problem_sizes(8):
+            example_inputs = {
+                "accum": self.fake_tensor(self.element, (l, m, n)),
+                "C": self.fake_tensor(self.element, (l, m, n)),
+                "alpha": 1.5,
+                "beta": 0.5,
+                "D": self.fake_tensor(self.element, (l, m, n))
+            }
+
+            launcher = EVTTestBed(self.element, evt_func_call2, example_inputs)
+            input_keys = ["C", "alpha", "beta"]
+            result_keys = ["D"]
+            launcher.verify((m, n, k), input_keys, result_keys, l)
+
 
 if __name__ == '__main__':
     unittest.main()
diff --git a/test/self_contained_includes/CMakeLists.txt b/test/self_contained_includes/CMakeLists.txt
index 6425b5cdbc..a90707468f 100644
--- a/test/self_contained_includes/CMakeLists.txt
+++ b/test/self_contained_includes/CMakeLists.txt
@@ -30,12 +30,107 @@
 # i.e. they can be included in a source file without needing to include other headers before it.
 
 set(header_files_to_check
+    # cutlass
+
+    # cutlass/gemm/kernel
     cutlass/gemm/kernel/default_gemm.h
     cutlass/gemm/kernel/default_gemm_complex.h
     cutlass/gemm/kernel/gemm_universal_decl.h
     # cutlass/gemm/kernel/sm90_gemm_warpspecialized.hpp
 
+    # cute
+    cute/config.hpp
+    cute/int_tuple.hpp
+    cute/layout.hpp
+    cute/layout_composed.hpp
+    cute/pointer.hpp
+    cute/pointer_base.hpp
+    cute/pointer_flagged.hpp
+    cute/pointer_swizzle.hpp
+    cute/stride.hpp
+    cute/swizzle.hpp
+    cute/swizzle_layout.hpp
+    cute/tensor.hpp
     cute/tensor_impl.hpp
+    cute/tensor_predicate.hpp
+    cute/underscore.hpp
+    # cute/algorithm
+    cute/algorithm/axpby.hpp
+    cute/algorithm/clear.hpp
+    # cute/algorithm/cooperative_copy.hpp
+    cute/algorithm/cooperative_gemm.hpp
+    # cute/algorithm/copy.hpp
+    cute/algorithm/fill.hpp
+    cute/algorithm/functional.hpp
+    # cute/algorithm/gemm.hpp
+    cute/algorithm/prefer.hpp
+    # cute/algorithm/prefetch.hpp
+    cute/algorithm/tensor_algorithms.hpp
+    cute/algorithm/tuple_algorithms.hpp
+
+    # cute/container
+    cute/container/alignment.hpp
+    cute/container/array.hpp
+    cute/container/array_aligned.hpp
+    cute/container/array_subbyte.hpp
+    cute/container/bit_field.hpp
+    cute/container/cuda_types.hpp
+    cute/container/packed_tuple.hpp
+    cute/container/tuple.hpp
+    cute/container/type_list.hpp
+
+    # cute/numeric
+    cute/numeric/arithmetic_tuple.hpp
+    cute/numeric/complex.hpp
+    cute/numeric/int.hpp
+    cute/numeric/integer_sequence.hpp
+    cute/numeric/integral_ratio.hpp
+    cute/numeric/math.hpp
+    cute/numeric/numeric_types.hpp
+    cute/numeric/real.hpp
+    cute/numeric/integral_constant.hpp
+
+    # cute/util
+    cute/util/debug.hpp
+    cute/util/print.hpp
+    cute/util/type_traits.hpp
+    # cute/arch
+    cute/arch/cluster_sm90.hpp
+    cute/arch/copy.hpp
+    cute/arch/copy_sm50.hpp
+    cute/arch/copy_sm75.hpp
+    cute/arch/copy_sm80.hpp
+    cute/arch/copy_sm90.hpp
+    cute/arch/copy_sm90_desc.hpp
+    cute/arch/copy_sm90_tma.hpp
+    cute/arch/mma_sm61.hpp
+    cute/arch/mma_sm70.hpp
+    cute/arch/mma_sm75.hpp
+    cute/arch/mma_sm80.hpp
+    cute/arch/mma_sm80_sparse.hpp
+    cute/arch/mma_sm90.hpp
+    cute/arch/mma_sm90_desc.hpp
+    cute/arch/mma_sm90_gmma.hpp
+    cute/arch/mma.hpp
+    cute/arch/util.hpp
+    # cute/atom
+    # cute/atom/copy_atom.hpp
+    # cute/atom/copy_traits.hpp
+    # cute/atom/copy_traits_sm50.hpp
+    # cute/atom/copy_traits_sm75.hpp
+    # cute/atom/copy_traits_sm80.hpp
+    # cute/atom/copy_traits_sm90.hpp
+    # cute/atom/copy_traits_sm90_im2col.hpp
+    # cute/atom/copy_traits_sm90_tma.hpp
+    # cute/atom/copy_traits_sm90_tma_swizzle.hpp
+    cute/atom/mma_atom.hpp
+    cute/atom/mma_traits.hpp
+    cute/atom/mma_traits_sm61.hpp
+    cute/atom/mma_traits_sm70.hpp
+    cute/atom/mma_traits_sm75.hpp
+    cute/atom/mma_traits_sm80.hpp
+    cute/atom/mma_traits_sm90.hpp
+    cute/atom/mma_traits_sm90_gmma.hpp
 )
 
 # for each header in _header_files:
diff --git a/test/unit/CMakeLists.txt b/test/unit/CMakeLists.txt
index 309ced52fa..5bb16fc41c 100644
--- a/test/unit/CMakeLists.txt
+++ b/test/unit/CMakeLists.txt
@@ -92,7 +92,7 @@ function(cutlass_test_unit_add_executable NAME)
     target_link_libraries(
       ${NAME}
       PUBLIC
-      GTest::gtest
+      GTest::gtest 
     )
   else()
     target_link_libraries(
diff --git a/test/unit/conv/cache_testbed_output.h b/test/unit/conv/cache_testbed_output.h
index 4f3981e83b..57b0a25355 100644
--- a/test/unit/conv/cache_testbed_output.h
+++ b/test/unit/conv/cache_testbed_output.h
@@ -609,7 +609,7 @@ inline CachedTestKey CreateCachedGemmTestKey(
   ElementCompute alpha,
   ElementCompute beta,
   cutlass::TensorView<ElementA, LayoutA> A,
-  cutlass::TensorView<ElementA, LayoutB> B,
+  cutlass::TensorView<ElementB, LayoutB> B,
   cutlass::TensorView<ElementC, LayoutC> C
 ) {
 
@@ -659,7 +659,7 @@ inline CachedTestKey CreateCachedConv2dTestKey(
   ElementCompute alpha,
   ElementCompute beta,
   cutlass::TensorView<ElementA, LayoutA> A,
-  cutlass::TensorView<ElementA, LayoutB> B,
+  cutlass::TensorView<ElementB, LayoutB> B,
   cutlass::TensorView<ElementC, LayoutC> C
 ) {
 
@@ -711,7 +711,7 @@ inline CachedTestKey CreateCachedConv2dWithBroadcastTestKey(
   ElementCompute alpha,
   ElementCompute beta,
   cutlass::TensorView<ElementA, LayoutA> A,
-  cutlass::TensorView<ElementA, LayoutB> B,
+  cutlass::TensorView<ElementB, LayoutB> B,
   cutlass::TensorView<ElementC, LayoutC> C
 ) {
 
@@ -763,7 +763,7 @@ inline CachedTestKey CreateCachedConv2dWithReductionTestKey(
   ElementCompute alpha,
   ElementCompute beta,
   cutlass::TensorView<ElementA, LayoutA> A,
-  cutlass::TensorView<ElementA, LayoutB> B,
+  cutlass::TensorView<ElementB, LayoutB> B,
   cutlass::TensorView<ElementC, LayoutC> C
 ) {
 
@@ -814,7 +814,7 @@ inline CachedTestKey CreateCachedConv3dTestKey(
   ElementCompute alpha,
   ElementCompute beta,
   cutlass::TensorView<ElementA, LayoutA> A,
-  cutlass::TensorView<ElementA, LayoutB> B,
+  cutlass::TensorView<ElementB, LayoutB> B,
   cutlass::TensorView<ElementC, LayoutC> C
 ) {
 
diff --git a/test/unit/conv/device/conv2d_testbed.h b/test/unit/conv/device/conv2d_testbed.h
index d957beb03d..2acf1cf680 100644
--- a/test/unit/conv/device/conv2d_testbed.h
+++ b/test/unit/conv/device/conv2d_testbed.h
@@ -718,7 +718,7 @@ bool TestAllConv2d(
   }
 
   // CUTLASS DGRAD's *strided* specialization does not support split-k mode 
-    if ((ImplicitGemm::kConvolutionalOperator == cutlass::conv::Operator::kDgrad ||
+  if ((ImplicitGemm::kConvolutionalOperator == cutlass::conv::Operator::kDgrad ||
           ImplicitGemm::kConvolutionalOperator == cutlass::conv::Operator::kDeconv) &&
       (ImplicitGemm::UnderlyingKernel::Mma::IteratorA::kStrideSupport == 
         cutlass::conv::StrideSupport::kStrided)) {
@@ -734,6 +734,18 @@ bool TestAllConv2d(
       cutlass::from_real<typename ImplicitGemm::ElementCompute>(2.0), 
       cutlass::from_real<typename ImplicitGemm::ElementCompute>(2.0));
 
+    passed = testbed.run(
+      cutlass::conv::Conv2dProblemSize(
+      {1, 56, 56, 8},   // input size (NHWC)
+      {8, 1, 1, 8},     // filter size (KRSC)
+      {0, 0, 0, 0},     // padding (pad_h, _, pad_w, _)
+      {1, 1},           // stride (stride_h, stride_w)
+      {1, 1})           // dilation (dilation_h, dilation_w)
+      .reset_split_k_slices(2),
+      cutlass::conv::SplitKMode::kSerial,
+      cutlass::from_real<typename ImplicitGemm::ElementCompute>(2.0), 
+      cutlass::from_real<typename ImplicitGemm::ElementCompute>(2.0));
+
     if (!passed) {
       return false;
     }
diff --git a/test/unit/conv/device/conv3d_testbed.h b/test/unit/conv/device/conv3d_testbed.h
index 54bf936333..54c8143fd4 100644
--- a/test/unit/conv/device/conv3d_testbed.h
+++ b/test/unit/conv/device/conv3d_testbed.h
@@ -246,12 +246,18 @@ class TestbedConv3d {
       split_k_mode
     );
 
+    cutlass::Status status = conv3d_op.can_implement(conv3d_args);
+    if (status != cutlass::Status::kSuccess) {
+      std::cerr << "can_implement failed for the given problem_size: \n";
+      return false;
+    }
+
     // find workspace requirement for parallel split-k reduction
     size_t workspace_size = Conv3d::get_workspace_size(conv3d_args);
 
     cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
 
-    cutlass::Status status = conv3d_op.initialize(conv3d_args, workspace.get());
+    status = conv3d_op.initialize(conv3d_args, workspace.get());
 
     if (status != cutlass::Status::kSuccess) {
       cudaError_t error = cudaGetLastError();
diff --git a/test/unit/conv/device_3x/conv_problem_sizes.hpp b/test/unit/conv/device_3x/conv_problem_sizes.hpp
index d7dd062321..ef651712d7 100644
--- a/test/unit/conv/device_3x/conv_problem_sizes.hpp
+++ b/test/unit/conv/device_3x/conv_problem_sizes.hpp
@@ -308,8 +308,8 @@ get_conv_problem_vector<2, cutlass::conv::Operator::kFprop>() {
   // 2x5 filter, asymmetric padding 1,0/1,0, w/ dilation
   problem_shapes.push_back({
     cutlass::conv::Mode::kCrossCorrelation,
-    {2,  16, 16, 64},
-    {256, 2, 5, 64},
+    {2,   16, 16, 64},
+    {256, 2,  5,  64},
     {1, 1},
     {0, 0},
     {1, 1},
@@ -319,8 +319,8 @@ get_conv_problem_vector<2, cutlass::conv::Operator::kFprop>() {
   // 2x5 filter, asymmetric padding 1,0/1,0, w/ stride, w/ dilation
   problem_shapes.push_back({
     cutlass::conv::Mode::kCrossCorrelation,
-    {2,  16, 15, 64},
-    {256, 2, 5, 64},
+    {2,   16, 15, 64},
+    {256, 2,  5,  64},
     {1, 1},
     {0, 0},
     {2, 3},
@@ -441,7 +441,7 @@ get_conv_problem_vector<3, cutlass::conv::Operator::kFprop>() {
   problem_shapes.push_back({
     cutlass::conv::Mode::kCrossCorrelation,
     {2,  16, 10, 16, 64},
-    {96, 3, 4, 5, 64},
+    {96, 3,  4,  5,  64},
     {1, 0, 1},
     {0, 2, 0},
     {1, 1, 1},
@@ -452,7 +452,7 @@ get_conv_problem_vector<3, cutlass::conv::Operator::kFprop>() {
   problem_shapes.push_back({
     cutlass::conv::Mode::kCrossCorrelation,
     {2,  16, 10, 16, 64},
-    {96, 3, 4, 5, 64},
+    {96, 3,  4,  5,  64},
     {1, 0, 1},
     {0, 2, 0},
     {2, 2, 3},
@@ -462,6 +462,7 @@ get_conv_problem_vector<3, cutlass::conv::Operator::kFprop>() {
   return problem_shapes;
 }
 
+
 /////////////////////////////////////////////////////////////////////////////////////////////////
 // Wgrad
 /////////////////////////////////////////////////////////////////////////////////////////////////
@@ -570,6 +571,28 @@ get_conv_problem_vector<1, cutlass::conv::Operator::kWgrad>() {
     {2},
     1
   });
+  // To test streamk, equals to gemm-MxNxK size 128x640x2048
+  problem_shapes.push_back({
+    cutlass::conv::Mode::kCrossCorrelation,
+    {2,   1024, 128},
+    {640, 1,    128},
+    {0},
+    {0},
+    {1},
+    {1},
+    1
+  });
+  // To test streamk, equals to gemm-MxNxK size 128x640x2080
+  problem_shapes.push_back({
+    cutlass::conv::Mode::kCrossCorrelation,
+    {2,   1040, 128},
+    {640, 1,    128},
+    {0},
+    {0},
+    {1},
+    {1},
+    1
+  });
   return problem_shapes;
 }
 
@@ -659,7 +682,7 @@ get_conv_problem_vector<2, cutlass::conv::Operator::kWgrad>() {
   problem_shapes.push_back({
     cutlass::conv::Mode::kCrossCorrelation,
     {2,   15, 16, 32},
-    {256, 2, 5, 32},
+    {256, 2,  5,  32},
     {1, 1},
     {0, 0},
     {2, 3},
@@ -670,7 +693,7 @@ get_conv_problem_vector<2, cutlass::conv::Operator::kWgrad>() {
   problem_shapes.push_back({
     cutlass::conv::Mode::kCrossCorrelation,
     {2,   16, 16, 32},
-    {256, 2, 5, 32},
+    {256, 2,  5,  32},
     {1, 1},
     {0, 0},
     {1, 1},
@@ -681,7 +704,7 @@ get_conv_problem_vector<2, cutlass::conv::Operator::kWgrad>() {
   problem_shapes.push_back({
     cutlass::conv::Mode::kCrossCorrelation,
     {2,   16, 15, 32},
-    {256, 2, 5, 32},
+    {256, 2,  5,  32},
     {1, 1},
     {0, 0},
     {2, 3},
@@ -690,26 +713,26 @@ get_conv_problem_vector<2, cutlass::conv::Operator::kWgrad>() {
   });
   // To test streamk, equals to gemm-MxNxK size 128x640x2048
   problem_shapes.push_back({
-     cutlass::conv::Mode::kCrossCorrelation,
-     {2, 64, 16, 128},  // nhwc
-     {640, 1, 1, 128},  // krsc
-     {0, 0},            // padding lower (pad_h, pad_w)
-     {0, 0},            // padding upper (pad_h, pad_w)
-     {1, 1},            // stride (stride_h, stride_w)
-     {1, 1},            // dilation (dilation_h, dilation_w)
-     1                  // group
-   });
+    cutlass::conv::Mode::kCrossCorrelation,
+    {2,   64, 16, 128},
+    {640, 1,  1,  128},
+    {0, 0},
+    {0, 0},
+    {1, 1},
+    {1, 1},
+    1
+  });
   // To test streamk, equals to gemm-MxNxK size 128x640x2080
   problem_shapes.push_back({
-     cutlass::conv::Mode::kCrossCorrelation,
-     {2, 65, 16, 128},  // nhwc
-     {640, 1, 1, 128},  // krsc
-     {0, 0},            // padding lower (pad_h, pad_w)
-     {0, 0},            // padding upper (pad_h, pad_w)
-     {1, 1},            // stride (stride_h, stride_w)
-     {1, 1},            // dilation (dilation_h, dilation_w)
-     1                  // group
-   });
+    cutlass::conv::Mode::kCrossCorrelation,
+    {2,   65, 16, 128},
+    {640, 1,  1,  128},
+    {0, 0},
+    {0, 0},
+    {1, 1},
+    {1, 1},
+    1
+  });
   return problem_shapes;
 }
 
@@ -755,7 +778,7 @@ get_conv_problem_vector<3, cutlass::conv::Operator::kWgrad>() {
   problem_shapes.push_back({
     cutlass::conv::Mode::kCrossCorrelation,
     {2,  16, 10, 16, 32},
-    {96, 3, 4, 5, 32},
+    {96, 3,  4,  5,  32},
     {1, 0, 1},
     {0, 2, 0},
     {2, 2, 3},
@@ -766,7 +789,7 @@ get_conv_problem_vector<3, cutlass::conv::Operator::kWgrad>() {
   problem_shapes.push_back({
     cutlass::conv::Mode::kCrossCorrelation,
     {2,  16, 10, 16, 32},
-    {96, 3, 4, 5, 32},
+    {96, 3,  4,  5,  32},
     {1, 0, 1},
     {0, 2, 0},
     {1, 1, 1},
@@ -775,26 +798,26 @@ get_conv_problem_vector<3, cutlass::conv::Operator::kWgrad>() {
   });
   // To test streamk, equals to gemm-MxNxK size 128x640x2048
   problem_shapes.push_back({
-     cutlass::conv::Mode::kCrossCorrelation,
-     {2,  1, 64, 16, 128},  // ndhwc
-     {640, 1, 1, 1, 128},  // ktrsc
-     {0, 0, 0},          // padding lower (pad_d, pad_h, pad_w)
-     {0, 0, 0},          // padding upper (pad_d, pad_h, pad_w)
-     {1, 1, 1},          // stride (stride_d, stride_h, stride_w)
-     {1, 1, 1},          // dilation (dilation_d, dilation_h, dilation_w)
-     1                   // group
-   });
+    cutlass::conv::Mode::kCrossCorrelation,
+    {2,   1, 64, 16, 128},
+    {640, 1, 1,  1,  128},
+    {0, 0, 0},
+    {0, 0, 0},
+    {1, 1, 1},
+    {1, 1, 1},
+    1
+  });
   // To test streamk, equals to gemm-MxNxK size 128x640x2080
   problem_shapes.push_back({
-     cutlass::conv::Mode::kCrossCorrelation,
-     {2,  1, 65, 16, 128},  // ndhwc
-     {640, 1, 1, 1, 128},  // ktrsc
-     {0, 0, 0},          // padding lower (pad_d, pad_h, pad_w)
-     {0, 0, 0},          // padding upper (pad_d, pad_h, pad_w)
-     {1, 1, 1},          // stride (stride_d, stride_h, stride_w)
-     {1, 1, 1},          // dilation (dilation_d, dilation_h, dilation_w)
-     1                   // group
-   });
+    cutlass::conv::Mode::kCrossCorrelation,
+    {2,   1, 65, 16, 128},
+    {640, 1, 1,  1,  128},
+    {0, 0, 0},
+    {0, 0, 0},
+    {1, 1, 1},
+    {1, 1, 1},
+    1
+  });
   return problem_shapes;
 }
 
@@ -848,8 +871,8 @@ get_conv_problem_vector<1, cutlass::conv::Operator::kDgrad, false>() {
   // Filter-K = 16 for predication
   problem_shapes.push_back({
     cutlass::conv::Mode::kCrossCorrelation,
-    {1, 8, 16},
-    {64,1, 16},
+    {1,  8, 16},
+    {64, 1, 16},
     {0},
     {0},
     {1},
@@ -870,7 +893,7 @@ get_conv_problem_vector<1, cutlass::conv::Operator::kDgrad, false>() {
   // N = 7 and K = 256 for a even larger grid
   problem_shapes.push_back({
     cutlass::conv::Mode::kCrossCorrelation,
-    {7,   8, 256},
+    {7,  8, 256},
     {64, 1, 256},
     {0},
     {0},
@@ -881,7 +904,7 @@ get_conv_problem_vector<1, cutlass::conv::Operator::kDgrad, false>() {
   // 3 filter, no padding
   problem_shapes.push_back({
     cutlass::conv::Mode::kCrossCorrelation,
-    {2,   8, 256},
+    {2,  8, 256},
     {64, 3, 256},
     {0},
     {0},
@@ -892,7 +915,7 @@ get_conv_problem_vector<1, cutlass::conv::Operator::kDgrad, false>() {
   // 3 filter, symmetric padding with k % cta_k !=0
   problem_shapes.push_back({
     cutlass::conv::Mode::kCrossCorrelation,
-    {2,   8, 256},
+    {2,  8, 256},
     {32, 3, 256},
     {1},
     {1},
@@ -903,7 +926,7 @@ get_conv_problem_vector<1, cutlass::conv::Operator::kDgrad, false>() {
   // 4 filter, asymmetric padding
   problem_shapes.push_back({
     cutlass::conv::Mode::kCrossCorrelation,
-    {2,   8, 256},
+    {2,  8, 256},
     {64, 4, 256},
     {0},
     {1},
@@ -915,7 +938,7 @@ get_conv_problem_vector<1, cutlass::conv::Operator::kDgrad, false>() {
   problem_shapes.push_back({
     cutlass::conv::Mode::kCrossCorrelation,
     {2,   16, 64},
-    {256, 3, 64},
+    {256, 3,  64},
     {0},
     {1},
     {1},
@@ -993,7 +1016,7 @@ get_conv_problem_vector<2, cutlass::conv::Operator::kDgrad, false>() {
   // N = 7 and K = 256 for a even larger grid
   problem_shapes.push_back({
     cutlass::conv::Mode::kCrossCorrelation,
-    {7,   8, 8, 256},
+    {7,  8, 8, 256},
     {64, 1, 1, 256},
     {0, 0},
     {0, 0},
@@ -1004,7 +1027,7 @@ get_conv_problem_vector<2, cutlass::conv::Operator::kDgrad, false>() {
   // 3x3 filter, no padding
   problem_shapes.push_back({
     cutlass::conv::Mode::kCrossCorrelation,
-    {2,   8, 8, 256},
+    {2,  8, 8, 256},
     {64, 3, 3, 256},
     {0, 0},
     {0, 0},
@@ -1015,7 +1038,7 @@ get_conv_problem_vector<2, cutlass::conv::Operator::kDgrad, false>() {
   // 3x3 filter, symmetric padding with k % cta_k !=0
   problem_shapes.push_back({
     cutlass::conv::Mode::kCrossCorrelation,
-    {2,   8, 8, 256},
+    {2,  8, 8, 256},
     {32, 3, 3, 256},
     {1, 1},
     {1, 1},
@@ -1026,7 +1049,7 @@ get_conv_problem_vector<2, cutlass::conv::Operator::kDgrad, false>() {
   // 2x5 filter, asymmetric padding 1,0/1,0
   problem_shapes.push_back({
     cutlass::conv::Mode::kCrossCorrelation,
-    {2,   8, 8, 256},
+    {2,  8, 8, 256},
     {64, 2, 5, 256},
     {1, 1},
     {0, 0},
@@ -1038,7 +1061,7 @@ get_conv_problem_vector<2, cutlass::conv::Operator::kDgrad, false>() {
   problem_shapes.push_back({
     cutlass::conv::Mode::kCrossCorrelation,
     {2,   16, 16, 64},
-    {256, 2, 5, 64},
+    {256, 2,  5,  64},
     {1, 1},
     {0, 0},
     {1, 1},
@@ -1116,7 +1139,7 @@ get_conv_problem_vector<3, cutlass::conv::Operator::kDgrad, false>() {
   problem_shapes.push_back({
     cutlass::conv::Mode::kCrossCorrelation,
     {2,  16, 10, 16, 64},
-    {64, 3, 4, 5, 96},
+    {64, 3,  4,  5,  96},
     {1, 0, 1},
     {0, 2, 0},
     {1, 1, 1},
@@ -1247,7 +1270,7 @@ get_conv_problem_vector<3, cutlass::conv::Operator::kDgrad, true>() {
     {64, 3, 4, 5, 96},
     {1, 0, 1},
     {0, 2, 0},
-    {2, 4, 2},
+    {2, 1, 2},
     {4, 2, 3},
     1
   });
diff --git a/test/unit/conv/device_3x/dgrad/sm90_conv1d_dgrad_implicit_gemm_f16_f16_f32_tensorop_f16.cu b/test/unit/conv/device_3x/dgrad/sm90_conv1d_dgrad_implicit_gemm_f16_f16_f32_tensorop_f16.cu
index fa98bdee13..5b8aaf6d4a 100644
--- a/test/unit/conv/device_3x/dgrad/sm90_conv1d_dgrad_implicit_gemm_f16_f16_f32_tensorop_f16.cu
+++ b/test/unit/conv/device_3x/dgrad/sm90_conv1d_dgrad_implicit_gemm_f16_f16_f32_tensorop_f16.cu
@@ -87,7 +87,9 @@ TEST(SM90_device_conv1d_dgrad_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f16, 6
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -130,7 +132,9 @@ TEST(SM90_device_conv1d_dgrad_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f16, 6
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -173,7 +177,9 @@ TEST(SM90_device_conv1d_dgrad_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f16, 6
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -216,7 +222,9 @@ TEST(SM90_device_conv1d_dgrad_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f16, 6
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -264,7 +272,9 @@ TEST(SM90_device_conv1d_dgrad_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f16, 1
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -307,7 +317,9 @@ TEST(SM90_device_conv1d_dgrad_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f16, 1
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -350,7 +362,9 @@ TEST(SM90_device_conv1d_dgrad_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f16, 1
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -393,7 +407,9 @@ TEST(SM90_device_conv1d_dgrad_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f16, 1
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
diff --git a/test/unit/conv/device_3x/dgrad/sm90_conv1d_dgrad_implicit_gemm_f16_f16_f32_tensorop_f32.cu b/test/unit/conv/device_3x/dgrad/sm90_conv1d_dgrad_implicit_gemm_f16_f16_f32_tensorop_f32.cu
index 63e3996482..61add31efe 100644
--- a/test/unit/conv/device_3x/dgrad/sm90_conv1d_dgrad_implicit_gemm_f16_f16_f32_tensorop_f32.cu
+++ b/test/unit/conv/device_3x/dgrad/sm90_conv1d_dgrad_implicit_gemm_f16_f16_f32_tensorop_f32.cu
@@ -87,7 +87,9 @@ TEST(SM90_device_conv1d_dgrad_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f32, 6
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -130,7 +132,9 @@ TEST(SM90_device_conv1d_dgrad_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f32, 6
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -173,7 +177,9 @@ TEST(SM90_device_conv1d_dgrad_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f32, 6
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -216,7 +222,9 @@ TEST(SM90_device_conv1d_dgrad_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f32, 6
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -263,7 +271,9 @@ TEST(SM90_device_conv1d_dgrad_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f32, 1
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -306,7 +316,9 @@ TEST(SM90_device_conv1d_dgrad_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f32, 1
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -349,7 +361,9 @@ TEST(SM90_device_conv1d_dgrad_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f32, 1
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -392,7 +406,9 @@ TEST(SM90_device_conv1d_dgrad_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f32, 1
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
diff --git a/test/unit/conv/device_3x/dgrad/sm90_conv2d_dgrad_implicit_gemm_f16_f16_f32_tensorop_f16.cu b/test/unit/conv/device_3x/dgrad/sm90_conv2d_dgrad_implicit_gemm_f16_f16_f32_tensorop_f16.cu
index d8f09fb62d..1928f5d77c 100644
--- a/test/unit/conv/device_3x/dgrad/sm90_conv2d_dgrad_implicit_gemm_f16_f16_f32_tensorop_f16.cu
+++ b/test/unit/conv/device_3x/dgrad/sm90_conv2d_dgrad_implicit_gemm_f16_f16_f32_tensorop_f16.cu
@@ -88,7 +88,9 @@ TEST(SM90_device_conv2d_dgrad_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f16
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -132,7 +134,9 @@ TEST(SM90_device_conv2d_dgrad_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f16
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -176,7 +180,9 @@ TEST(SM90_device_conv2d_dgrad_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f16
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -220,7 +226,9 @@ TEST(SM90_device_conv2d_dgrad_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f16
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -268,7 +276,9 @@ TEST(SM90_device_conv2d_dgrad_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f16
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -312,7 +322,9 @@ TEST(SM90_device_conv2d_dgrad_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f16
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -356,7 +368,9 @@ TEST(SM90_device_conv2d_dgrad_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f16
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -400,7 +414,9 @@ TEST(SM90_device_conv2d_dgrad_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f16
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
diff --git a/test/unit/conv/device_3x/dgrad/sm90_conv2d_dgrad_implicit_gemm_f16_f16_f32_tensorop_f32.cu b/test/unit/conv/device_3x/dgrad/sm90_conv2d_dgrad_implicit_gemm_f16_f16_f32_tensorop_f32.cu
index ee5471c1aa..95365a20ee 100644
--- a/test/unit/conv/device_3x/dgrad/sm90_conv2d_dgrad_implicit_gemm_f16_f16_f32_tensorop_f32.cu
+++ b/test/unit/conv/device_3x/dgrad/sm90_conv2d_dgrad_implicit_gemm_f16_f16_f32_tensorop_f32.cu
@@ -85,7 +85,9 @@ TEST(SM90_device_conv2d_dgrad_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -129,7 +131,9 @@ TEST(SM90_device_conv2d_dgrad_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -173,7 +177,9 @@ TEST(SM90_device_conv2d_dgrad_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -217,7 +223,9 @@ TEST(SM90_device_conv2d_dgrad_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -265,7 +273,9 @@ TEST(SM90_device_conv2d_dgrad_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -309,7 +319,9 @@ TEST(SM90_device_conv2d_dgrad_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -353,7 +365,9 @@ TEST(SM90_device_conv2d_dgrad_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -397,7 +411,9 @@ TEST(SM90_device_conv2d_dgrad_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
diff --git a/test/unit/conv/device_3x/dgrad/sm90_conv3d_dgrad_implicit_gemm_f16_f16_f32_tensorop_f16.cu b/test/unit/conv/device_3x/dgrad/sm90_conv3d_dgrad_implicit_gemm_f16_f16_f32_tensorop_f16.cu
index deead8e5e7..17d41b5aba 100644
--- a/test/unit/conv/device_3x/dgrad/sm90_conv3d_dgrad_implicit_gemm_f16_f16_f32_tensorop_f16.cu
+++ b/test/unit/conv/device_3x/dgrad/sm90_conv3d_dgrad_implicit_gemm_f16_f16_f32_tensorop_f16.cu
@@ -85,7 +85,9 @@ TEST(SM90_device_conv3d_dgrad_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -130,7 +132,9 @@ TEST(SM90_device_conv3d_dgrad_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -174,7 +178,9 @@ TEST(SM90_device_conv3d_dgrad_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -218,7 +224,9 @@ TEST(SM90_device_conv3d_dgrad_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -266,7 +274,9 @@ TEST(SM90_device_conv3d_dgrad_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -310,7 +320,9 @@ TEST(SM90_device_conv3d_dgrad_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -354,7 +366,9 @@ TEST(SM90_device_conv3d_dgrad_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -398,7 +412,9 @@ TEST(SM90_device_conv3d_dgrad_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
diff --git a/test/unit/conv/device_3x/dgrad/sm90_conv3d_dgrad_implicit_gemm_f16_f16_f32_tensorop_f32.cu b/test/unit/conv/device_3x/dgrad/sm90_conv3d_dgrad_implicit_gemm_f16_f16_f32_tensorop_f32.cu
index e2100db4a2..b742364990 100644
--- a/test/unit/conv/device_3x/dgrad/sm90_conv3d_dgrad_implicit_gemm_f16_f16_f32_tensorop_f32.cu
+++ b/test/unit/conv/device_3x/dgrad/sm90_conv3d_dgrad_implicit_gemm_f16_f16_f32_tensorop_f32.cu
@@ -85,7 +85,9 @@ TEST(SM90_device_conv3d_dgrad_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -129,7 +131,9 @@ TEST(SM90_device_conv3d_dgrad_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -173,7 +177,9 @@ TEST(SM90_device_conv3d_dgrad_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -217,7 +223,9 @@ TEST(SM90_device_conv3d_dgrad_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -265,7 +273,9 @@ TEST(SM90_device_conv3d_dgrad_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -309,7 +319,9 @@ TEST(SM90_device_conv3d_dgrad_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -353,7 +365,9 @@ TEST(SM90_device_conv3d_dgrad_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -397,7 +411,9 @@ TEST(SM90_device_conv3d_dgrad_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
diff --git a/test/unit/conv/device_3x/fprop/sm90_conv1d_fprop_implicit_gemm_f16_f16_f32_tensorop_f16.cu b/test/unit/conv/device_3x/fprop/sm90_conv1d_fprop_implicit_gemm_f16_f16_f32_tensorop_f16.cu
index 924b977ed0..b501bdfb96 100644
--- a/test/unit/conv/device_3x/fprop/sm90_conv1d_fprop_implicit_gemm_f16_f16_f32_tensorop_f16.cu
+++ b/test/unit/conv/device_3x/fprop/sm90_conv1d_fprop_implicit_gemm_f16_f16_f32_tensorop_f16.cu
@@ -84,7 +84,9 @@ TEST(SM90_device_conv1d_fprop_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f16, 6
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>; 
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -127,7 +129,9 @@ TEST(SM90_device_conv1d_fprop_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f16, 6
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -169,8 +173,10 @@ TEST(SM90_device_conv1d_fprop_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f16, 6
       cutlass::conv::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
-
+  
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -213,7 +219,9 @@ TEST(SM90_device_conv1d_fprop_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f16, 6
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -261,7 +269,9 @@ TEST(SM90_device_conv1d_fprop_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f16, 1
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -304,7 +314,9 @@ TEST(SM90_device_conv1d_fprop_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f16, 1
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -347,7 +359,9 @@ TEST(SM90_device_conv1d_fprop_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f16, 1
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -390,7 +404,9 @@ TEST(SM90_device_conv1d_fprop_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f16, 1
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
diff --git a/test/unit/conv/device_3x/fprop/sm90_conv1d_fprop_implicit_gemm_f16_f16_f32_tensorop_f32.cu b/test/unit/conv/device_3x/fprop/sm90_conv1d_fprop_implicit_gemm_f16_f16_f32_tensorop_f32.cu
index cdb7fd4c7a..4e2ef6bf3b 100644
--- a/test/unit/conv/device_3x/fprop/sm90_conv1d_fprop_implicit_gemm_f16_f16_f32_tensorop_f32.cu
+++ b/test/unit/conv/device_3x/fprop/sm90_conv1d_fprop_implicit_gemm_f16_f16_f32_tensorop_f32.cu
@@ -84,7 +84,9 @@ TEST(SM90_device_conv1d_fprop_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f32, 6
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -127,7 +129,9 @@ TEST(SM90_device_conv1d_fprop_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f32, 6
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -170,7 +174,9 @@ TEST(SM90_device_conv1d_fprop_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f32, 6
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -213,7 +219,9 @@ TEST(SM90_device_conv1d_fprop_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f32, 6
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -260,7 +268,10 @@ TEST(SM90_device_conv1d_fprop_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f32, 1
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -303,7 +314,10 @@ TEST(SM90_device_conv1d_fprop_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f32, 1
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -346,7 +360,10 @@ TEST(SM90_device_conv1d_fprop_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f32, 1
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -389,7 +406,9 @@ TEST(SM90_device_conv1d_fprop_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f32, 1
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
diff --git a/test/unit/conv/device_3x/fprop/sm90_conv1d_fprop_implicit_gemm_s8_s8_s32_tensorop_s32.cu b/test/unit/conv/device_3x/fprop/sm90_conv1d_fprop_implicit_gemm_s8_s8_s32_tensorop_s32.cu
index e8111ba13e..98f7f0012a 100644
--- a/test/unit/conv/device_3x/fprop/sm90_conv1d_fprop_implicit_gemm_s8_s8_s32_tensorop_s32.cu
+++ b/test/unit/conv/device_3x/fprop/sm90_conv1d_fprop_implicit_gemm_s8_s8_s32_tensorop_s32.cu
@@ -55,6 +55,7 @@ using namespace cute;
 // Cluster 1x1x1
 //
 TEST(SM90_device_conv1d_fprop_implicitgemm_s8nwc_s8nwc_s32nwc_tensor_op_s32, 64x64x64_1x1x1) {
+
   using ElementAct     = int8_t;
   using ElementFlt     = int8_t;
   using ElementOut     = int32_t;
@@ -84,10 +85,13 @@ TEST(SM90_device_conv1d_fprop_implicitgemm_s8nwc_s8nwc_s32nwc_tensor_op_s32, 64x
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape = cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::NumSpatialDimensions>;
+
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
-    >;
+  >;
 
   using Conv = cutlass::conv::device::ConvUniversalAdapter<ConvKernel>;
 
@@ -128,7 +132,10 @@ TEST(SM90_device_conv1d_fprop_implicitgemm_s8nwc_s8nwc_s32nwc_tensor_op_s32, 64x
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape = cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
+
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -171,7 +178,9 @@ TEST(SM90_device_conv1d_fprop_implicitgemm_s8nwc_s8nwc_s32nwc_tensor_op_s32, 64x
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape = cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -214,7 +223,10 @@ TEST(SM90_device_conv1d_fprop_implicitgemm_s8nwc_s8nwc_s32nwc_tensor_op_s32, 64x
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape = cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
+
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -261,7 +273,9 @@ TEST(SM90_device_conv1d_fprop_implicitgemm_s8nwc_s8nwc_s32nwc_tensor_op_s32, 128
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape = cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -304,7 +318,9 @@ TEST(SM90_device_conv1d_fprop_implicitgemm_s8nwc_s8nwc_s32nwc_tensor_op_s32, 128
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape = cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -347,7 +363,9 @@ TEST(SM90_device_conv1d_fprop_implicitgemm_s8nwc_s8nwc_s32nwc_tensor_op_s32, 128
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape = cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -390,7 +408,9 @@ TEST(SM90_device_conv1d_fprop_implicitgemm_s8nwc_s8nwc_s32nwc_tensor_op_s32, 128
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape = cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
diff --git a/test/unit/conv/device_3x/fprop/sm90_conv1d_fprop_implicit_gemm_tf32_tf32_f32_tensorop_f32.cu b/test/unit/conv/device_3x/fprop/sm90_conv1d_fprop_implicit_gemm_tf32_tf32_f32_tensorop_f32.cu
index 222ea3ec47..4b35683ac0 100644
--- a/test/unit/conv/device_3x/fprop/sm90_conv1d_fprop_implicit_gemm_tf32_tf32_f32_tensorop_f32.cu
+++ b/test/unit/conv/device_3x/fprop/sm90_conv1d_fprop_implicit_gemm_tf32_tf32_f32_tensorop_f32.cu
@@ -84,7 +84,9 @@ TEST(SM90_device_conv1d_fprop_implicitgemm_tf32nwc_tf32nwc_f32nwc_tensor_op_f32,
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -127,7 +129,9 @@ TEST(SM90_device_conv1d_fprop_implicitgemm_tf32nwc_tf32nwc_f32nwc_tensor_op_f32,
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -170,7 +174,9 @@ TEST(SM90_device_conv1d_fprop_implicitgemm_tf32nwc_tf32nwc_f32nwc_tensor_op_f32,
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -213,7 +219,9 @@ TEST(SM90_device_conv1d_fprop_implicitgemm_tf32nwc_tf32nwc_f32nwc_tensor_op_f32,
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -260,7 +268,9 @@ TEST(SM90_device_conv1d_fprop_implicitgemm_tf32nwc_tf32nwc_f32nwc_tensor_op_f32,
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -303,7 +313,9 @@ TEST(SM90_device_conv1d_fprop_implicitgemm_tf32nwc_tf32nwc_f32nwc_tensor_op_f32,
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -346,7 +358,9 @@ TEST(SM90_device_conv1d_fprop_implicitgemm_tf32nwc_tf32nwc_f32nwc_tensor_op_f32,
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>; 
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -389,7 +403,10 @@ TEST(SM90_device_conv1d_fprop_implicitgemm_tf32nwc_tf32nwc_f32nwc_tensor_op_f32,
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
diff --git a/test/unit/conv/device_3x/fprop/sm90_conv2d_fprop_implicit_gemm_f16_f16_f32_tensorop_f16.cu b/test/unit/conv/device_3x/fprop/sm90_conv2d_fprop_implicit_gemm_f16_f16_f32_tensorop_f16.cu
index 47d510b864..62378de758 100644
--- a/test/unit/conv/device_3x/fprop/sm90_conv2d_fprop_implicit_gemm_f16_f16_f32_tensorop_f16.cu
+++ b/test/unit/conv/device_3x/fprop/sm90_conv2d_fprop_implicit_gemm_f16_f16_f32_tensorop_f16.cu
@@ -85,7 +85,9 @@ TEST(SM90_device_conv2d_fprop_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f16
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -130,7 +132,9 @@ TEST(SM90_device_conv2d_fprop_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f16
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -175,7 +179,9 @@ TEST(SM90_device_conv2d_fprop_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f16
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -220,7 +226,9 @@ TEST(SM90_device_conv2d_fprop_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f16
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -269,7 +277,10 @@ TEST(SM90_device_conv2d_fprop_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f16
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -314,7 +325,9 @@ TEST(SM90_device_conv2d_fprop_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f16
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -359,7 +372,9 @@ TEST(SM90_device_conv2d_fprop_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f16
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -404,7 +419,9 @@ TEST(SM90_device_conv2d_fprop_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f16
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
diff --git a/test/unit/conv/device_3x/fprop/sm90_conv2d_fprop_implicit_gemm_f16_f16_f32_tensorop_f32.cu b/test/unit/conv/device_3x/fprop/sm90_conv2d_fprop_implicit_gemm_f16_f16_f32_tensorop_f32.cu
index 7faffa266b..7058d46093 100644
--- a/test/unit/conv/device_3x/fprop/sm90_conv2d_fprop_implicit_gemm_f16_f16_f32_tensorop_f32.cu
+++ b/test/unit/conv/device_3x/fprop/sm90_conv2d_fprop_implicit_gemm_f16_f16_f32_tensorop_f32.cu
@@ -85,7 +85,9 @@ TEST(SM90_device_conv2d_fprop_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -130,7 +132,9 @@ TEST(SM90_device_conv2d_fprop_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -175,7 +179,9 @@ TEST(SM90_device_conv2d_fprop_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -220,7 +226,9 @@ TEST(SM90_device_conv2d_fprop_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -269,7 +277,9 @@ TEST(SM90_device_conv2d_fprop_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -314,7 +324,9 @@ TEST(SM90_device_conv2d_fprop_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -359,7 +371,9 @@ TEST(SM90_device_conv2d_fprop_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -404,7 +418,9 @@ TEST(SM90_device_conv2d_fprop_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
diff --git a/test/unit/conv/device_3x/fprop/sm90_conv2d_fprop_implicit_gemm_s8_s8_s32_tensorop_s32.cu b/test/unit/conv/device_3x/fprop/sm90_conv2d_fprop_implicit_gemm_s8_s8_s32_tensorop_s32.cu
index 569483144c..dbddc698a0 100644
--- a/test/unit/conv/device_3x/fprop/sm90_conv2d_fprop_implicit_gemm_s8_s8_s32_tensorop_s32.cu
+++ b/test/unit/conv/device_3x/fprop/sm90_conv2d_fprop_implicit_gemm_s8_s8_s32_tensorop_s32.cu
@@ -85,7 +85,9 @@ TEST(SM90_device_conv2d_fprop_implicitgemm_s8nhwc_s8nhwc_s32nhwc_tensor_op_s32,
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -129,7 +131,9 @@ TEST(SM90_device_conv2d_fprop_implicitgemm_s8nhwc_s8nhwc_s32nhwc_tensor_op_s32,
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -173,7 +177,9 @@ TEST(SM90_device_conv2d_fprop_implicitgemm_s8nhwc_s8nhwc_s32nhwc_tensor_op_s32,
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -217,7 +223,9 @@ TEST(SM90_device_conv2d_fprop_implicitgemm_s8nhwc_s8nhwc_s32nhwc_tensor_op_s32,
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -265,7 +273,9 @@ TEST(SM90_device_conv2d_fprop_implicitgemm_s8nhwc_s8nhwc_s32nhwc_tensor_op_s32,
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -309,7 +319,9 @@ TEST(SM90_device_conv2d_fprop_implicitgemm_s8nhwc_s8nhwc_s32nhwc_tensor_op_s32,
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -353,7 +365,9 @@ TEST(SM90_device_conv2d_fprop_implicitgemm_s8nhwc_s8nhwc_s32nhwc_tensor_op_s32,
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -397,7 +411,9 @@ TEST(SM90_device_conv2d_fprop_implicitgemm_s8nhwc_s8nhwc_s32nhwc_tensor_op_s32,
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
diff --git a/test/unit/conv/device_3x/fprop/sm90_conv2d_fprop_implicit_gemm_tf32_tf32_f32_tensorop_f32.cu b/test/unit/conv/device_3x/fprop/sm90_conv2d_fprop_implicit_gemm_tf32_tf32_f32_tensorop_f32.cu
index 6e397cfbd9..7fab79d35a 100644
--- a/test/unit/conv/device_3x/fprop/sm90_conv2d_fprop_implicit_gemm_tf32_tf32_f32_tensorop_f32.cu
+++ b/test/unit/conv/device_3x/fprop/sm90_conv2d_fprop_implicit_gemm_tf32_tf32_f32_tensorop_f32.cu
@@ -85,7 +85,9 @@ TEST(SM90_device_conv2d_fprop_implicitgemm_tf32nhwc_tf32nhwc_f32nhwc_tensor_op_f
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -130,7 +132,9 @@ TEST(SM90_device_conv2d_fprop_implicitgemm_tf32nhwc_tf32nhwc_f32nhwc_tensor_op_f
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -175,7 +179,9 @@ TEST(SM90_device_conv2d_fprop_implicitgemm_tf32nhwc_tf32nhwc_f32nhwc_tensor_op_f
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -220,7 +226,9 @@ TEST(SM90_device_conv2d_fprop_implicitgemm_tf32nhwc_tf32nhwc_f32nhwc_tensor_op_f
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -269,7 +277,9 @@ TEST(SM90_device_conv2d_fprop_implicitgemm_tf32nhwc_tf32nhwc_f32nhwc_tensor_op_f
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -314,7 +324,9 @@ TEST(SM90_device_conv2d_fprop_implicitgemm_tf32nhwc_tf32nhwc_f32nhwc_tensor_op_f
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -359,7 +371,9 @@ TEST(SM90_device_conv2d_fprop_implicitgemm_tf32nhwc_tf32nhwc_f32nhwc_tensor_op_f
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -404,7 +418,9 @@ TEST(SM90_device_conv2d_fprop_implicitgemm_tf32nhwc_tf32nhwc_f32nhwc_tensor_op_f
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
diff --git a/test/unit/conv/device_3x/fprop/sm90_conv3d_fprop_implicit_gemm_f16_f16_f32_tensorop_f16.cu b/test/unit/conv/device_3x/fprop/sm90_conv3d_fprop_implicit_gemm_f16_f16_f32_tensorop_f16.cu
index 91841f61cd..c2a30d859a 100644
--- a/test/unit/conv/device_3x/fprop/sm90_conv3d_fprop_implicit_gemm_f16_f16_f32_tensorop_f16.cu
+++ b/test/unit/conv/device_3x/fprop/sm90_conv3d_fprop_implicit_gemm_f16_f16_f32_tensorop_f16.cu
@@ -85,7 +85,9 @@ TEST(SM90_device_conv3d_fprop_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -128,8 +130,10 @@ TEST(SM90_device_conv3d_fprop_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
-
+ 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -173,7 +177,9 @@ TEST(SM90_device_conv3d_fprop_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -217,7 +223,9 @@ TEST(SM90_device_conv3d_fprop_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -265,7 +273,9 @@ TEST(SM90_device_conv3d_fprop_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -309,7 +319,9 @@ TEST(SM90_device_conv3d_fprop_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -353,7 +365,9 @@ TEST(SM90_device_conv3d_fprop_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -397,7 +411,9 @@ TEST(SM90_device_conv3d_fprop_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
diff --git a/test/unit/conv/device_3x/fprop/sm90_conv3d_fprop_implicit_gemm_f16_f16_f32_tensorop_f32.cu b/test/unit/conv/device_3x/fprop/sm90_conv3d_fprop_implicit_gemm_f16_f16_f32_tensorop_f32.cu
index 6f10cf86ab..af2a4a9ce2 100644
--- a/test/unit/conv/device_3x/fprop/sm90_conv3d_fprop_implicit_gemm_f16_f16_f32_tensorop_f32.cu
+++ b/test/unit/conv/device_3x/fprop/sm90_conv3d_fprop_implicit_gemm_f16_f16_f32_tensorop_f32.cu
@@ -85,7 +85,9 @@ TEST(SM90_device_conv3d_fprop_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -129,7 +131,9 @@ TEST(SM90_device_conv3d_fprop_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -173,7 +177,9 @@ TEST(SM90_device_conv3d_fprop_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -217,7 +223,9 @@ TEST(SM90_device_conv3d_fprop_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -265,7 +273,9 @@ TEST(SM90_device_conv3d_fprop_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -309,7 +319,9 @@ TEST(SM90_device_conv3d_fprop_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -353,7 +365,9 @@ TEST(SM90_device_conv3d_fprop_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -397,7 +411,9 @@ TEST(SM90_device_conv3d_fprop_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
diff --git a/test/unit/conv/device_3x/fprop/sm90_conv3d_fprop_implicit_gemm_s8_s8_s32_tensorop_s32.cu b/test/unit/conv/device_3x/fprop/sm90_conv3d_fprop_implicit_gemm_s8_s8_s32_tensorop_s32.cu
index 7f8d1475b0..417ed2e1f8 100644
--- a/test/unit/conv/device_3x/fprop/sm90_conv3d_fprop_implicit_gemm_s8_s8_s32_tensorop_s32.cu
+++ b/test/unit/conv/device_3x/fprop/sm90_conv3d_fprop_implicit_gemm_s8_s8_s32_tensorop_s32.cu
@@ -85,7 +85,9 @@ TEST(SM90_device_conv3d_fprop_implicitgemm_s8ndhwc_s8ndhwc_s32ndhwc_tensor_op_s3
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -129,7 +131,9 @@ TEST(SM90_device_conv3d_fprop_implicitgemm_s8ndhwc_s8ndhwc_s32ndhwc_tensor_op_s3
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -173,7 +177,9 @@ TEST(SM90_device_conv3d_fprop_implicitgemm_s8ndhwc_s8ndhwc_s32ndhwc_tensor_op_s3
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -217,7 +223,9 @@ TEST(SM90_device_conv3d_fprop_implicitgemm_s8ndhwc_s8ndhwc_s32ndhwc_tensor_op_s3
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -265,7 +273,9 @@ TEST(SM90_device_conv3d_fprop_implicitgemm_s8ndhwc_s8ndhwc_s32ndhwc_tensor_op_s3
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -309,7 +319,9 @@ TEST(SM90_device_conv3d_fprop_implicitgemm_s8ndhwc_s8ndhwc_s32ndhwc_tensor_op_s3
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -353,7 +365,9 @@ TEST(SM90_device_conv3d_fprop_implicitgemm_s8ndhwc_s8ndhwc_s32ndhwc_tensor_op_s3
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -397,7 +411,9 @@ TEST(SM90_device_conv3d_fprop_implicitgemm_s8ndhwc_s8ndhwc_s32ndhwc_tensor_op_s3
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
diff --git a/test/unit/conv/device_3x/fprop/sm90_conv3d_fprop_implicit_gemm_tf32_tf32_f32_tensorop_f32.cu b/test/unit/conv/device_3x/fprop/sm90_conv3d_fprop_implicit_gemm_tf32_tf32_f32_tensorop_f32.cu
index 85d18040ee..60a81a6778 100644
--- a/test/unit/conv/device_3x/fprop/sm90_conv3d_fprop_implicit_gemm_tf32_tf32_f32_tensorop_f32.cu
+++ b/test/unit/conv/device_3x/fprop/sm90_conv3d_fprop_implicit_gemm_tf32_tf32_f32_tensorop_f32.cu
@@ -85,7 +85,9 @@ TEST(SM90_device_conv3d_fprop_implicitgemm_tf32ndhwc_tf32ndhwc_f32ndhwc_tensor_o
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -129,7 +131,9 @@ TEST(SM90_device_conv3d_fprop_implicitgemm_tf32ndhwc_tf32ndhwc_f32ndhwc_tensor_o
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -173,7 +177,9 @@ TEST(SM90_device_conv3d_fprop_implicitgemm_tf32ndhwc_tf32ndhwc_f32ndhwc_tensor_o
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -217,7 +223,9 @@ TEST(SM90_device_conv3d_fprop_implicitgemm_tf32ndhwc_tf32ndhwc_f32ndhwc_tensor_o
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -265,7 +273,9 @@ TEST(SM90_device_conv3d_fprop_implicitgemm_tf32ndhwc_tf32ndhwc_f32ndhwc_tensor_o
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -309,7 +319,9 @@ TEST(SM90_device_conv3d_fprop_implicitgemm_tf32ndhwc_tf32ndhwc_f32ndhwc_tensor_o
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -353,7 +365,9 @@ TEST(SM90_device_conv3d_fprop_implicitgemm_tf32ndhwc_tf32ndhwc_f32ndhwc_tensor_o
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -397,7 +411,9 @@ TEST(SM90_device_conv3d_fprop_implicitgemm_tf32ndhwc_tf32ndhwc_f32ndhwc_tensor_o
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
diff --git a/test/unit/conv/device_3x/testbed_conv.hpp b/test/unit/conv/device_3x/testbed_conv.hpp
index 0545e78346..3227f3d631 100644
--- a/test/unit/conv/device_3x/testbed_conv.hpp
+++ b/test/unit/conv/device_3x/testbed_conv.hpp
@@ -120,12 +120,11 @@ struct DenseConvParams {
 
   // get the default arguments without sparse data
   auto get_mainloop_arguments(
-    ProblemShape const& problem_shape,  
+    [[maybe_unused]] ProblemShape const& problem_shape,  
     thrust::universal_vector<ElementA>& tensor_A,
     thrust::universal_vector<ElementB>& tensor_B
   ) {
     auto args = typename Conv::ConvKernel::MainloopArguments {
-      problem_shape, 
       tensor_A.data().get(),
       tensor_B.data().get(),
     };
@@ -298,11 +297,12 @@ struct ConvTestbed {
    using DecompositionMode = typename cutlass::gemm::kernel::detail::PersistentTileSchedulerSm90StreamKParams::DecompositionMode;
 
     typename Conv::ConvKernel::TileScheduler::Arguments scheduler_args{};
-    if constexpr (cute::is_same_v<typename Conv::ConvKernel::TileScheduler::Arguments, cutlass::gemm::StreamKScheduler>) {
+    if constexpr (cute::is_same_v<typename Conv::ConvKernel::TileSchedulerTag, cutlass::gemm::StreamKScheduler>) {
       scheduler_args = { static_cast<int>(splits), static_cast<int>(max_swizzle), raster_order, decomposition_mode };
     }
 
-    auto mainloop_args = params.get_mainloop_arguments(problem_shape, tensor_A, tensor_B);
+    auto mainloop_args = params.get_mainloop_arguments(problem_shape, tensor_A, tensor_B); 
+
     auto epilogue_args = typename Conv::ConvKernel::EpilogueArguments {
       {},
       tensor_C.data().get(),
@@ -312,6 +312,7 @@ struct ConvTestbed {
     };
 
     auto args = typename Conv::Arguments {
+      problem_shape,
       mainloop_args, // MainloopArguments
       epilogue_args, // EpilogueArguments
       hw_info,
@@ -615,8 +616,10 @@ bool TestAllConv(double alpha = 1.0, double beta = 0.0, float epsilon = 0.0f
     #endif
     for (DecompositionMode decomp_mode : decomposition_modes) {
       std::vector problem_splits = {Splits{1}};
-      if (decomp_mode == DecompositionMode::Heuristic || decomp_mode == DecompositionMode::SplitK) {
-        problem_splits.push_back(Splits{2});
+      if constexpr (UsesStreamKScheduler) {
+        if (decomp_mode == DecompositionMode::Heuristic || decomp_mode == DecompositionMode::SplitK) {
+          problem_splits.push_back(Splits{2});
+        }
       }
       for (auto splits : problem_splits) {
 
diff --git a/test/unit/conv/device_3x/wgrad/sm90_conv1d_wgrad_implicit_gemm_f16_f16_f32_tensorop_f16.cu b/test/unit/conv/device_3x/wgrad/sm90_conv1d_wgrad_implicit_gemm_f16_f16_f32_tensorop_f16.cu
index 85057119b4..1ef19fc8f1 100644
--- a/test/unit/conv/device_3x/wgrad/sm90_conv1d_wgrad_implicit_gemm_f16_f16_f32_tensorop_f16.cu
+++ b/test/unit/conv/device_3x/wgrad/sm90_conv1d_wgrad_implicit_gemm_f16_f16_f32_tensorop_f16.cu
@@ -84,7 +84,9 @@ TEST(SM90_device_conv1d_wgrad_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f16, 6
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -127,7 +129,9 @@ TEST(SM90_device_conv1d_wgrad_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f16, 6
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -170,7 +174,9 @@ TEST(SM90_device_conv1d_wgrad_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f16, 6
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -213,7 +219,9 @@ TEST(SM90_device_conv1d_wgrad_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f16, 6
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -261,7 +269,9 @@ TEST(SM90_device_conv1d_wgrad_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f16, 1
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -304,7 +314,9 @@ TEST(SM90_device_conv1d_wgrad_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f16, 1
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -347,7 +359,9 @@ TEST(SM90_device_conv1d_wgrad_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f16, 1
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -390,7 +404,9 @@ TEST(SM90_device_conv1d_wgrad_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f16, 1
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
diff --git a/test/unit/conv/device_3x/wgrad/sm90_conv1d_wgrad_implicit_gemm_f16_f16_f32_tensorop_f32.cu b/test/unit/conv/device_3x/wgrad/sm90_conv1d_wgrad_implicit_gemm_f16_f16_f32_tensorop_f32.cu
index 01addcec5b..9385fc2f14 100644
--- a/test/unit/conv/device_3x/wgrad/sm90_conv1d_wgrad_implicit_gemm_f16_f16_f32_tensorop_f32.cu
+++ b/test/unit/conv/device_3x/wgrad/sm90_conv1d_wgrad_implicit_gemm_f16_f16_f32_tensorop_f32.cu
@@ -84,7 +84,9 @@ TEST(SM90_device_conv1d_wgrad_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f32, 6
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -127,7 +129,9 @@ TEST(SM90_device_conv1d_wgrad_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f32, 6
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -170,7 +174,9 @@ TEST(SM90_device_conv1d_wgrad_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f32, 6
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -213,7 +219,9 @@ TEST(SM90_device_conv1d_wgrad_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f32, 6
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -260,7 +268,9 @@ TEST(SM90_device_conv1d_wgrad_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f32, 1
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -303,7 +313,9 @@ TEST(SM90_device_conv1d_wgrad_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f32, 1
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -346,7 +358,9 @@ TEST(SM90_device_conv1d_wgrad_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f32, 1
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -389,7 +403,9 @@ TEST(SM90_device_conv1d_wgrad_implicitgemm_f16nwc_f16nwc_f32nwc_tensor_op_f32, 1
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
diff --git a/test/unit/conv/device_3x/wgrad/sm90_conv2d_wgrad_implicit_gemm_f16_f16_f32_tensorop_f16.cu b/test/unit/conv/device_3x/wgrad/sm90_conv2d_wgrad_implicit_gemm_f16_f16_f32_tensorop_f16.cu
index c7c8f0ec86..8c8177fdd0 100644
--- a/test/unit/conv/device_3x/wgrad/sm90_conv2d_wgrad_implicit_gemm_f16_f16_f32_tensorop_f16.cu
+++ b/test/unit/conv/device_3x/wgrad/sm90_conv2d_wgrad_implicit_gemm_f16_f16_f32_tensorop_f16.cu
@@ -85,7 +85,10 @@ TEST(SM90_device_conv2d_wgrad_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f16
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -129,7 +132,9 @@ TEST(SM90_device_conv2d_wgrad_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f16
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -173,7 +178,9 @@ TEST(SM90_device_conv2d_wgrad_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f16
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -217,7 +224,9 @@ TEST(SM90_device_conv2d_wgrad_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f16
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -265,7 +274,9 @@ TEST(SM90_device_conv2d_wgrad_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f16
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -309,7 +320,9 @@ TEST(SM90_device_conv2d_wgrad_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f16
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -353,7 +366,9 @@ TEST(SM90_device_conv2d_wgrad_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f16
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -397,7 +412,9 @@ TEST(SM90_device_conv2d_wgrad_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f16
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
diff --git a/test/unit/conv/device_3x/wgrad/sm90_conv2d_wgrad_implicit_gemm_f16_f16_f32_tensorop_f32.cu b/test/unit/conv/device_3x/wgrad/sm90_conv2d_wgrad_implicit_gemm_f16_f16_f32_tensorop_f32.cu
index 864c7566fe..a1a28ec54e 100644
--- a/test/unit/conv/device_3x/wgrad/sm90_conv2d_wgrad_implicit_gemm_f16_f16_f32_tensorop_f32.cu
+++ b/test/unit/conv/device_3x/wgrad/sm90_conv2d_wgrad_implicit_gemm_f16_f16_f32_tensorop_f32.cu
@@ -85,7 +85,9 @@ TEST(SM90_device_conv2d_wgrad_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -129,7 +131,9 @@ TEST(SM90_device_conv2d_wgrad_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -173,7 +177,9 @@ TEST(SM90_device_conv2d_wgrad_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -217,7 +223,9 @@ TEST(SM90_device_conv2d_wgrad_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -265,7 +273,9 @@ TEST(SM90_device_conv2d_wgrad_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -309,7 +319,9 @@ TEST(SM90_device_conv2d_wgrad_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -353,7 +365,9 @@ TEST(SM90_device_conv2d_wgrad_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -397,7 +411,9 @@ TEST(SM90_device_conv2d_wgrad_implicitgemm_f16nhwc_f16nhwc_f32nhwc_tensor_op_f32
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
diff --git a/test/unit/conv/device_3x/wgrad/sm90_conv3d_wgrad_implicit_gemm_f16_f16_f32_tensorop_f16.cu b/test/unit/conv/device_3x/wgrad/sm90_conv3d_wgrad_implicit_gemm_f16_f16_f32_tensorop_f16.cu
index 5417345212..34c9ccd0b8 100644
--- a/test/unit/conv/device_3x/wgrad/sm90_conv3d_wgrad_implicit_gemm_f16_f16_f32_tensorop_f16.cu
+++ b/test/unit/conv/device_3x/wgrad/sm90_conv3d_wgrad_implicit_gemm_f16_f16_f32_tensorop_f16.cu
@@ -85,7 +85,9 @@ TEST(SM90_device_conv3d_wgrad_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -129,7 +131,9 @@ TEST(SM90_device_conv3d_wgrad_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -173,7 +177,9 @@ TEST(SM90_device_conv3d_wgrad_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -218,7 +224,9 @@ TEST(SM90_device_conv3d_wgrad_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -266,7 +274,9 @@ TEST(SM90_device_conv3d_wgrad_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -310,7 +320,9 @@ TEST(SM90_device_conv3d_wgrad_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -354,7 +366,9 @@ TEST(SM90_device_conv3d_wgrad_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -398,7 +412,9 @@ TEST(SM90_device_conv3d_wgrad_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
diff --git a/test/unit/conv/device_3x/wgrad/sm90_conv3d_wgrad_implicit_gemm_f16_f16_f32_tensorop_f32.cu b/test/unit/conv/device_3x/wgrad/sm90_conv3d_wgrad_implicit_gemm_f16_f16_f32_tensorop_f32.cu
index 734d14a87f..554893357e 100644
--- a/test/unit/conv/device_3x/wgrad/sm90_conv3d_wgrad_implicit_gemm_f16_f16_f32_tensorop_f32.cu
+++ b/test/unit/conv/device_3x/wgrad/sm90_conv3d_wgrad_implicit_gemm_f16_f16_f32_tensorop_f32.cu
@@ -85,7 +85,9 @@ TEST(SM90_device_conv3d_wgrad_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -129,7 +131,9 @@ TEST(SM90_device_conv3d_wgrad_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -173,7 +177,9 @@ TEST(SM90_device_conv3d_wgrad_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -217,7 +223,9 @@ TEST(SM90_device_conv3d_wgrad_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -265,7 +273,9 @@ TEST(SM90_device_conv3d_wgrad_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -309,7 +319,9 @@ TEST(SM90_device_conv3d_wgrad_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -353,7 +365,9 @@ TEST(SM90_device_conv3d_wgrad_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
@@ -397,7 +411,9 @@ TEST(SM90_device_conv3d_wgrad_implicitgemm_f16ndhwc_f16ndhwc_f32ndhwc_tensor_op_
       cutlass::conv::collective::KernelScheduleAuto
     >::CollectiveOp;
 
+  using ProblemShape=cutlass::conv::ConvProblemShape<CollectiveMainloop::DispatchPolicy::ConvOp, CollectiveMainloop::DispatchPolicy::NumSpatialDimensions>;
   using ConvKernel = cutlass::conv::kernel::ConvUniversal<
+      ProblemShape,
       CollectiveMainloop,
       CollectiveEpilogue
     >;
diff --git a/test/unit/core/fast_numeric_conversion.cu b/test/unit/core/fast_numeric_conversion.cu
index 0d6e2401a9..9f7a886fa2 100644
--- a/test/unit/core/fast_numeric_conversion.cu
+++ b/test/unit/core/fast_numeric_conversion.cu
@@ -69,7 +69,7 @@ void run_test_integer_range_limited() {
   cutlass::HostTensor<Source, cutlass::layout::RowMajor> source({1, kN});
 
   for (int i = 0; i < kN; ++i) {
-    source.host_data()[i] = Source(i % 4);
+    source.host_view().at({0, i}) = Source(i % 4);
   }
 
   source.sync_device();
@@ -82,7 +82,7 @@ void run_test_integer_range_limited() {
   destination.sync_host();
 
   for (int i = 0; i < kN; ++i) {
-    EXPECT_TRUE(float(destination.host_data()[i]) == float(source.host_data()[i]));
+    EXPECT_TRUE(float(destination.host_view().at({0, i})) == float(source.host_view().at({0, i})));
   }
 }
 
@@ -97,12 +97,12 @@ void run_test_integer_range_all() {
   cutlass::HostTensor<Destination, cutlass::layout::RowMajor> destination({1, kN});
   cutlass::HostTensor<Source, cutlass::layout::RowMajor> source({1, kN});
 
-  int const kIntSourceMin = std::numeric_limits<Source>::min();
-  int const kIntSourceMax = std::numeric_limits<Source>::max();
+  int const kIntSourceMin = cutlass::platform::numeric_limits<Source>::lowest();
+  int const kIntSourceMax = cutlass::platform::numeric_limits<Source>::max();
   int const kIntRange = kIntSourceMax - kIntSourceMin + 1;
 
   for (int i = 0; i < kN; ++i) {
-    source.host_data()[i] = Source(kIntSourceMin + (i % kIntRange));
+    source.host_view().at({0, i}) = Source(kIntSourceMin + (i % kIntRange));
 
   }
 
@@ -117,19 +117,21 @@ void run_test_integer_range_all() {
 
   // Verify conversion
   bool passed = true;
+
   for (int i = 0; i < kN; ++i) {
-    if(!(float(destination.host_data()[i]) == float(source.host_data()[i]))) {
+    if(!(float(destination.host_view().at({0, i})) == float(source.host_view().at({0, i})))) {
       passed = false;
       break;
     }
   }
+
   EXPECT_TRUE(passed) << " FastNumericArrayConverter failed";
 
    // Print out results for the failed conversion.
    if (!passed) {
     for (int i = 0; i < kN; ++i) {
-        std::cout << "source(" << float(source.host_data()[i]) << ") -> "
-                  << "destination ("<< float(destination.host_data()[i]) << ")" << std::endl;
+        std::cout << "source(" << float(source.host_view().at({0, i})) << ") -> "
+                  << "destination ("<< float(destination.host_view().at({0, i})) << ")" << std::endl;
     }
    }
    std::flush(std::cout);
@@ -188,3 +190,10 @@ TEST(FastNumericConversion, s8_to_bf16_array) {
   using Destination = cutlass::bfloat16_t;
   test::core::kernel::run_test_integer_range_all<Destination, Source, kN>();
 }
+
+TEST(FastNumericConversion, s4_to_s8_array) {
+  int const kN = 16;
+  using Source = cutlass::int4b_t;
+  using Destination = int8_t;
+  test::core::kernel::run_test_integer_range_all<Destination, Source, kN>();
+}
diff --git a/test/unit/core/functional.cu b/test/unit/core/functional.cu
index 174a089555..4d7656173e 100644
--- a/test/unit/core/functional.cu
+++ b/test/unit/core/functional.cu
@@ -491,4 +491,78 @@ TEST(Functional, multiply_add_quaternion_f32) {
   Functional_multiply_add_QuaternionT<float>();
 }
 
+namespace cutlass_test {
+
+__global__ void
+test_cutlass_maximum(cutlass::half_t const* in1, cutlass::half_t const* in2, cutlass::half_t* out)
+{
+  {
+  constexpr bool propagate_NaN = true;
+  cutlass::maximum<cutlass::half_t, propagate_NaN> op;
+  if (threadIdx.x == 0 && threadIdx.y == 0 && threadIdx.z == 0
+    && blockIdx.x == 0 && blockIdx.y == 0 && blockIdx.z == 0) {
+    *out = op(*in1, *in2);
+  }
+  }
+  constexpr bool propagate_NaN = false;
+  cutlass::maximum<cutlass::half_t, propagate_NaN> op;
+  if (threadIdx.x == 0 && threadIdx.y == 0 && threadIdx.z == 0
+    && blockIdx.x == 0 && blockIdx.y == 0 && blockIdx.z == 0) {
+    *out = op(*in1, *in2);
+  }
+}
+
+} // cutlass_test
+
+// Test compilation on both host and device.
+TEST(Functional, maximum_half_host_propagate_NaN) {
+  constexpr bool propagate_NaN = true;
+  cutlass::maximum<cutlass::half_t, propagate_NaN> op;
+  cutlass::half_t x(1.0f);
+  cutlass::half_t y(2.0f);
+
+  auto result = op(x, y);
+  static_assert(std::is_same_v<decltype(result), cutlass::half_t>);
+  EXPECT_EQ(result, y);
+  result = op(y, x);
+  EXPECT_EQ(result, y);
+}
+
+TEST(Functional, maximum_half_host_dont_propagate_NaN) {
+  constexpr bool propagate_NaN = false;
+  cutlass::maximum<cutlass::half_t, propagate_NaN> op;
+  cutlass::half_t x(1.0f);
+  cutlass::half_t y(2.0f);
+
+  auto result = op(x, y);
+  static_assert(std::is_same_v<decltype(result), cutlass::half_t>);
+  EXPECT_EQ(result, y);
+  result = op(y, x);
+  EXPECT_EQ(result, y);
+}
+
+TEST(FUnction, maximum_half_device) {
+  using Tensor = cutlass::HostTensor<cutlass::half_t, cutlass::layout::RowMajor>;
+
+  Tensor in1({1, 1});
+  Tensor in2({1, 1});
+  Tensor out({1, 1});
+  in1.host_data()[0] = cutlass::half_t(1.0f);
+  in2.host_data()[0] = cutlass::half_t(2.0f);
+  out.host_data()[0] = cutlass::half_t(0.0f);
+
+  in1.sync_device();
+  in2.sync_device();
+  out.sync_device();
+
+  cutlass_test::test_cutlass_maximum<<< 1, 1 >>>(
+    in1.device_data(),
+    in2.device_data(),
+    out.device_data()
+  );
+  out.sync_host();
+
+  EXPECT_EQ(out.host_data()[0], 2.0f);
+}
+
 /////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/test/unit/core/numeric_conversion.cu b/test/unit/core/numeric_conversion.cu
index 1c966a9cc8..44d6cf9f33 100644
--- a/test/unit/core/numeric_conversion.cu
+++ b/test/unit/core/numeric_conversion.cu
@@ -217,6 +217,15 @@ TEST(NumericConversion, f32_to_fe4m3_rn) {
   test::core::kernel::run_test<Destination, Source, kN>(dest_name, source_name);
 }
 
+TEST(NumericConversion, f32_to_fe4m3_rn_2_elements) {
+  int const kN = 2;
+  using Source = float;
+  const char source_name[] = "float";
+  using Destination = cutlass::float_e4m3_t;
+  const char dest_name[] = "float_e4m3_t";
+  test::core::kernel::run_test<Destination, Source, kN>(dest_name, source_name);
+}
+
 TEST(NumericConversion, f32_to_fe4m3_rn_array) {
   int const kN = 27;
   using Source = float;
@@ -235,6 +244,15 @@ TEST(NumericConversion, f32_to_fe5m2_rn) {
   test::core::kernel::run_test<Destination, Source, kN>(dest_name, source_name);
 }
 
+TEST(NumericConversion, f32_to_fe5m2_rn_2_elements) {
+  int const kN = 2;
+  using Source = float;
+  const char source_name[] = "float";
+  using Destination = cutlass::float_e5m2_t;
+  const char dest_name[] = "float_e5m2_t";
+  test::core::kernel::run_test<Destination, Source, kN>(dest_name, source_name);
+}
+
 TEST(NumericConversion, f32_to_fe5m2_rn_array) {
   int const kN = 27;
   using Source = float;
@@ -253,6 +271,15 @@ TEST(NumericConversion, f16_to_fe4m3_rn) {
   test::core::kernel::run_test<Destination, Source, kN>(dest_name, source_name);
 }
 
+TEST(NumericConversion, f16_to_fe4m3_rn_2_elements) {
+  int const kN = 2;
+  using Source = cutlass::half_t;
+  const char source_name[] = "half_t";
+  using Destination = cutlass::float_e4m3_t;
+  const char dest_name[] = "float_e4m3_t";
+  test::core::kernel::run_test<Destination, Source, kN>(dest_name, source_name);
+}
+
 TEST(NumericConversion, f16_to_fe4m3_rn_array) {
   int const kN = 27;
   using Source = cutlass::half_t;
@@ -271,6 +298,15 @@ TEST(NumericConversion, f16_to_fe5m2_rn) {
   test::core::kernel::run_test<Destination, Source, kN>(dest_name, source_name);
 }
 
+TEST(NumericConversion, f16_to_fe5m2_rn_2_elements) {
+  int const kN = 27;
+  using Source = cutlass::half_t;
+  const char source_name[] = "half_t";
+  using Destination = cutlass::float_e5m2_t;
+  const char dest_name[] = "float_e5m2_t";
+  test::core::kernel::run_test<Destination, Source, kN>(dest_name, source_name);
+}
+
 TEST(NumericConversion, f16_to_fe5m2_rn_array) {
   int const kN = 27;
   using Source = cutlass::half_t;
@@ -289,6 +325,15 @@ TEST(NumericConversion, bf16_to_fe4m3_rn) {
   test::core::kernel::run_test<Destination, Source, kN>(dest_name, source_name);
 }
 
+TEST(NumericConversion, bf16_to_fe4m3_rn_2_elements) {
+  int const kN = 27;
+  using Source = cutlass::bfloat16_t;
+  const char source_name[] = "bfloat16_t";
+  using Destination = cutlass::float_e4m3_t;
+  const char dest_name[] = "float_e4m3_t";
+  test::core::kernel::run_test<Destination, Source, kN>(dest_name, source_name);
+}
+
 TEST(NumericConversion, bf16_to_fe4m3_rn_array) {
   int const kN = 27;
   using Source = cutlass::bfloat16_t;
@@ -307,6 +352,15 @@ TEST(NumericConversion, bf16_to_fe5m2_rn) {
   test::core::kernel::run_test<Destination, Source, kN>(dest_name, source_name);
 }
 
+TEST(NumericConversion, bf16_to_fe5m2_rn_2_elements) {
+  int const kN = 27;
+  using Source = cutlass::bfloat16_t;
+  const char source_name[] = "bfloat16_t";
+  using Destination = cutlass::float_e5m2_t;
+  const char dest_name[] = "float_e5m2_t";
+  test::core::kernel::run_test<Destination, Source, kN>(dest_name, source_name);
+}
+
 TEST(NumericConversion, bf16_to_fe5m2_rn_array) {
   int const kN = 27;
   using Source = cutlass::bfloat16_t;
@@ -327,6 +381,15 @@ TEST(NumericConversion, fe4m3_to_fe5m2_rn) {
   test::core::kernel::run_test<Destination, Source, kN>(dest_name, source_name);
 }
 
+TEST(NumericConversion, fe4m3_to_fe5m2_2_elements) {
+  int const kN = 27;
+  using Source = cutlass::float_e4m3_t;
+  const char source_name[] = "float_e4m3_t";
+  using Destination = cutlass::float_e5m2_t;
+  const char dest_name[] = "float_e5m2_t";
+  test::core::kernel::run_test<Destination, Source, kN>(dest_name, source_name);
+}
+
 TEST(NumericConversion, fe4m3_to_fe5m2_array) {
   int const kN = 27;
   using Source = cutlass::float_e4m3_t;
@@ -345,6 +408,15 @@ TEST(NumericConversion, fe5m2_to_fe4m3_rn) {
   test::core::kernel::run_test<Destination, Source, kN>(dest_name, source_name);
 }
 
+TEST(NumericConversion, fe5m2_to_fe4m3_2_elements) {
+  int const kN = 27;
+  using Source = cutlass::float_e5m2_t;
+  const char source_name[] = "float_e5m2_t";
+  using Destination = cutlass::float_e4m3_t;
+  const char dest_name[] = "float_e4m3_t";
+  test::core::kernel::run_test<Destination, Source, kN>(dest_name, source_name);
+}
+
 TEST(NumericConversion, fe5m2_to_fe4m3_array) {
   int const kN = 27;
   using Source = cutlass::float_e5m2_t;
@@ -375,6 +447,15 @@ TEST(NumericConversion, f32x8_to_s8x8_rn) {
   test::core::kernel::run_test<Destination, Source, kN>(dest_name, source_name);
 }
 
+TEST(NumericConversion, fe4m3_to_f32_2_elements) {
+  int const kN = 2;
+  using Source = cutlass::float_e4m3_t;
+  const char source_name[] = "float_e4m3_t";
+  using Destination = float;
+  const char dest_name[] = "float";
+  test::core::kernel::run_test<Destination, Source, kN>(dest_name, source_name);
+}
+
 TEST(NumericConversion, fe4m3_to_f32_array) {
   int const kN = 27;
   using Source = cutlass::float_e4m3_t;
@@ -384,6 +465,15 @@ TEST(NumericConversion, fe4m3_to_f32_array) {
   test::core::kernel::run_test<Destination, Source, kN>(dest_name, source_name);
 }
 
+TEST(NumericConversion, fe5m2_to_f32_2_elements) {
+  int const kN = 2;
+  using Source = cutlass::float_e5m2_t;
+  const char source_name[] = "float_e5m2_t";
+  using Destination = float;
+  const char dest_name[] = "float";
+  test::core::kernel::run_test<Destination, Source, kN>(dest_name, source_name);
+}
+
 TEST(NumericConversion, fe5m2_to_f32_array) {
   int const kN = 27;
   using Source = cutlass::float_e5m2_t;
@@ -402,6 +492,15 @@ TEST(NumericConversion, fe4m3_to_f16_rn) {
   test::core::kernel::run_test<Destination, Source, kN>(dest_name, source_name);
 }
 
+TEST(NumericConversion, fe4m3_to_f16_2_elements) {
+  int const kN = 2;
+  using Source = cutlass::float_e4m3_t;
+  const char source_name[] = "float_e4m3_t";
+  using Destination = cutlass::half_t;
+  const char dest_name[] = "half_t";
+  test::core::kernel::run_test<Destination, Source, kN>(dest_name, source_name);
+}
+
 TEST(NumericConversion, fe4m3_to_f16_array) {
   int const kN = 27;
   using Source = cutlass::float_e4m3_t;
@@ -420,6 +519,15 @@ TEST(NumericConversion, fe5m2_to_f16_rn) {
   test::core::kernel::run_test<Destination, Source, kN>(dest_name, source_name);
 }
 
+TEST(NumericConversion, fe5m2_to_f16_2_elements) {
+  int const kN = 2;
+  using Source = cutlass::float_e5m2_t;
+  const char source_name[] = "float_e5m2_t";
+  using Destination = cutlass::half_t;
+  const char dest_name[] = "half_t";
+  test::core::kernel::run_test<Destination, Source, kN>(dest_name, source_name);
+}
+
 TEST(NumericConversion, fe5m2_to_f16_array) {
   int const kN = 27;
   using Source = cutlass::float_e5m2_t;
@@ -438,6 +546,15 @@ TEST(NumericConversion, fe4m3_to_bf16_rn) {
   test::core::kernel::run_test<Destination, Source, kN>(dest_name, source_name);
 }
 
+TEST(NumericConversion, fe4m3_to_bf16_2_elements) {
+  int const kN = 2;
+  using Source = cutlass::float_e4m3_t;
+  const char source_name[] = "float_e4m3_t";
+  using Destination = cutlass::bfloat16_t;
+  const char dest_name[] = "bfloat16_t";
+  test::core::kernel::run_test<Destination, Source, kN>(dest_name, source_name);
+}
+
 TEST(NumericConversion, fe4m3_to_bf16_array) {
   int const kN = 27;
   using Source = cutlass::float_e4m3_t;
@@ -456,6 +573,15 @@ TEST(NumericConversion, fe5m2_to_bf16_rn) {
   test::core::kernel::run_test<Destination, Source, kN>(dest_name, source_name);
 }
 
+TEST(NumericConversion, fe5m2_to_bf16_2_elements) {
+  int const kN = 2;
+  using Source = cutlass::float_e5m2_t;
+  const char source_name[] = "float_e5m2_t";
+  using Destination = cutlass::bfloat16_t;
+  const char dest_name[] = "bfloat16_t";
+  test::core::kernel::run_test<Destination, Source, kN>(dest_name, source_name);
+}
+
 TEST(NumericConversion, fe5m2_to_bf16_array) {
   int const kN = 27;
   using Source = cutlass::float_e5m2_t;
diff --git a/test/unit/cute/ampere/cooperative_copy.cu b/test/unit/cute/ampere/cooperative_copy.cu
index 59a73a7c12..e3513326dd 100644
--- a/test/unit/cute/ampere/cooperative_copy.cu
+++ b/test/unit/cute/ampere/cooperative_copy.cu
@@ -40,6 +40,8 @@
 #include <tuple>
 
 #include <cute/tensor.hpp>
+#include <cute/swizzle.hpp> // cute::Swizzle
+#include <cute/swizzle_layout.hpp> // cute::compose(cute::Swizzle)
 #include <cute/numeric/numeric_types.hpp>
 
 using namespace cute;
diff --git a/test/unit/cute/ampere/cooperative_gemm.cu b/test/unit/cute/ampere/cooperative_gemm.cu
index 02196204ad..2ba01933e2 100644
--- a/test/unit/cute/ampere/cooperative_gemm.cu
+++ b/test/unit/cute/ampere/cooperative_gemm.cu
@@ -32,6 +32,8 @@
 #include "cutlass_unit_test.h"
 
 #include <cute/tensor.hpp>
+#include <cute/swizzle.hpp> // cute::Swizzle
+#include <cute/swizzle_layout.hpp> // cute::compose(cute::Swizzle)
 
 #include "../cooperative_gemm_common.hpp"
 
diff --git a/test/unit/cute/ampere/tiled_cp_async.cu b/test/unit/cute/ampere/tiled_cp_async.cu
index 86fa214e12..7a535145e7 100644
--- a/test/unit/cute/ampere/tiled_cp_async.cu
+++ b/test/unit/cute/ampere/tiled_cp_async.cu
@@ -38,6 +38,7 @@
 #include <vector>
 #include <numeric>
 #include <cute/tensor.hpp>
+#include <cute/swizzle.hpp> // cute::Swizzle
 
 #include "tiled_cp_async_testbed.hpp"
 
diff --git a/test/unit/cute/core/CMakeLists.txt b/test/unit/cute/core/CMakeLists.txt
index e6dea35f78..77037ac473 100644
--- a/test/unit/cute/core/CMakeLists.txt
+++ b/test/unit/cute/core/CMakeLists.txt
@@ -39,6 +39,7 @@ cutlass_test_unit_add_executable(
   constants.cpp
   core_unit.cpp
   domain_distribute.cpp
+  int_tuple.cpp
   inverse_left.cpp
   inverse_right.cpp
   logical_divide.cpp
@@ -49,8 +50,8 @@ cutlass_test_unit_add_executable(
   packed_tuple.cpp
   pointer.cpp
   reverse.cpp
+  swizzle_layout.cpp
   transform.cpp
   tuple.cpp
   tuple_find.cpp
-  int_tuple.cpp
 )
diff --git a/test/unit/cute/core/composition.cpp b/test/unit/cute/core/composition.cpp
index 8e043f89e7..679e7a0080 100644
--- a/test/unit/cute/core/composition.cpp
+++ b/test/unit/cute/core/composition.cpp
@@ -29,13 +29,16 @@
  *
  **************************************************************************************************/
 
-#include "cutlass_unit_test.h"
-
 #include <cutlass/trace.h>
 
+#include <cute/layout.hpp>
+#include <cute/layout_composed.hpp>  // cute::composition
+#include <cute/swizzle.hpp>          // cute::Swizzle
+#include <cute/swizzle_layout.hpp>   // cute::composition
+#include <cute/tensor.hpp>
 #include <iostream>
 
-#include <cute/tensor.hpp>
+#include "cutlass_unit_test.h"
 
 using namespace cute;
 
diff --git a/test/unit/cute/core/domain_distribute.cpp b/test/unit/cute/core/domain_distribute.cpp
index 55a4f76a67..f12b917765 100644
--- a/test/unit/cute/core/domain_distribute.cpp
+++ b/test/unit/cute/core/domain_distribute.cpp
@@ -29,7 +29,7 @@
  *
  **************************************************************************************************/
 
-#define CUTLASS_DEBUG_TRACE_LEVEL 1
+//#define CUTLASS_DEBUG_TRACE_LEVEL 1
 
 #include "cutlass_unit_test.h"
 
@@ -41,7 +41,6 @@
 
 using namespace cute;
 
-
 template <class LayoutA, class LayoutB>
 void
 test_distribute(LayoutA const& layoutA,
@@ -54,8 +53,8 @@ test_distribute(LayoutA const& layoutA,
   CUTLASS_TRACE_HOST("  =>  ");
   CUTLASS_TRACE_HOST(layoutR);
 
-  // Test that layout B is softly compatible with layout R
-  EXPECT_TRUE(softly_compatible(layoutB, layoutR));
+  EXPECT_TRUE(evenly_divides(layoutB, size(layoutR)));
+  EXPECT_TRUE(evenly_divides(layoutA, layoutR));
 
   // Post-condition on the codomain of the distribute
   for (int i = 0; i < size(layoutR); ++i) {
diff --git a/test/unit/cute/core/int_tuple.cpp b/test/unit/cute/core/int_tuple.cpp
index d68ff2a789..0ef68f7afa 100644
--- a/test/unit/cute/core/int_tuple.cpp
+++ b/test/unit/cute/core/int_tuple.cpp
@@ -33,10 +33,10 @@
 
 #include <cute/layout.hpp>
 
+using namespace cute;
+
 TEST(CuTe_core, WeaklyCongruent)
 {
-  using namespace cute;
-
   auto a = _1{};
   auto b = _2{};
   EXPECT_TRUE (weakly_congruent(a, a));
@@ -83,96 +83,116 @@ TEST(CuTe_core, WeaklyCongruent)
   EXPECT_TRUE (weakly_congruent(a2, b3));
 }
 
-TEST(CuTe_core, WeaklyCompatible)
+template <class A, class B>
+auto test_evenly_divides(A const& a, B const& b)
 {
-  using namespace cute;
-
-  auto a = _16{};
-  auto b = _12{};
-  auto c = _8{};
-  EXPECT_TRUE (weakly_compatible(a, a));
-  EXPECT_TRUE (weakly_compatible(b, b));
-  EXPECT_TRUE (weakly_compatible(c, c));
-  EXPECT_FALSE(weakly_compatible(a, b));
-  EXPECT_FALSE(weakly_compatible(a, c));
-  EXPECT_TRUE (weakly_compatible(c, a));
-
-  auto a0 = Shape<_16>{};
-  EXPECT_TRUE (weakly_compatible(a0, a0));
-  EXPECT_TRUE (weakly_compatible(a , a0));
-  EXPECT_FALSE(weakly_compatible(a0, a ));
-  EXPECT_TRUE (weakly_compatible(c , a0));
-  EXPECT_FALSE(weakly_compatible(a0, c ));
-  EXPECT_FALSE(weakly_compatible(b , a0));
-  EXPECT_FALSE(weakly_compatible(a0, b ));
-
-  auto a1 = Shape<_2,_8>{};
-  EXPECT_TRUE (weakly_compatible(a1, a1));
-  EXPECT_TRUE (weakly_compatible(a , a1));
-  EXPECT_FALSE(weakly_compatible(a0, a1));
-  EXPECT_FALSE(weakly_compatible(a1, a0));
-  EXPECT_TRUE (weakly_compatible(a1, Shape<_2,Shape<_2,_4>>{}));
-
-  auto a2 = Shape<Shape<_2,_8>>{};
-  EXPECT_TRUE (weakly_compatible(a2, a2));
-  EXPECT_TRUE (weakly_compatible(a , a2));
-  EXPECT_TRUE (weakly_compatible(c , a2));
-  EXPECT_TRUE (weakly_compatible(a0, a2));
-  EXPECT_FALSE(weakly_compatible(a2, a0));
-
-  auto a3 = Shape<Shape<_2,Shape<_4,_2>>>{};
-  EXPECT_TRUE (weakly_compatible(a3, a3));
-  EXPECT_TRUE (weakly_compatible(a , a3));
-  EXPECT_TRUE (weakly_compatible(c , a3));
-  EXPECT_TRUE (weakly_compatible(a0, a3));
-  EXPECT_FALSE(weakly_compatible(a3, a0));
-  EXPECT_TRUE (weakly_compatible(a2, a3));
-  EXPECT_FALSE(weakly_compatible(a3, a2));
+  auto result = evenly_divides(a, b);
+  // If A and B are static, then result should be as well
+  if constexpr (is_static<A>::value && is_static<B>::value) {
+    static_assert(is_static<decltype(result)>::value);
+  }
+  // If result is true_type, then confirm divisibillity
+  if constexpr (is_constant<true, decltype(result)>::value) {
+    CUTE_STATIC_ASSERT_V(size(a) == size(logical_divide(make_layout(shape(a)), b)));
+  }
+
+  return result;
 }
 
-TEST(CuTe_core, SoftlyCompatible)
+TEST(CuTe_core, Divides)
 {
-  using namespace cute;
-
+  {
   auto a = _16{};
   auto b = _12{};
   auto c = _8{};
-  EXPECT_TRUE (softly_compatible(a, a));
-  EXPECT_TRUE (softly_compatible(b, b));
-  EXPECT_TRUE (softly_compatible(c, c));
-  EXPECT_FALSE(softly_compatible(a, b));
-  EXPECT_TRUE (softly_compatible(a, c));
-  EXPECT_FALSE(softly_compatible(c, a));
+  EXPECT_TRUE (test_evenly_divides(a, a));
+  EXPECT_TRUE (test_evenly_divides(b, b));
+  EXPECT_TRUE (test_evenly_divides(c, c));
+  EXPECT_FALSE(test_evenly_divides(a, b));
+  EXPECT_TRUE (test_evenly_divides(a, c));
+  EXPECT_FALSE(test_evenly_divides(c, a));
 
   auto a0 = Shape<_16>{};
-  EXPECT_TRUE (softly_compatible(a0, a0));
-  EXPECT_TRUE (softly_compatible(a , a0));
-  EXPECT_FALSE(softly_compatible(a0, a ));
-  EXPECT_FALSE(softly_compatible(c , a0));
-  EXPECT_FALSE(softly_compatible(a0, c ));
-  EXPECT_FALSE(softly_compatible(b , a0));
-  EXPECT_FALSE(softly_compatible(a0, b ));
+  EXPECT_TRUE (test_evenly_divides(a0, a0));
+  EXPECT_TRUE (test_evenly_divides(a , a0));
+  EXPECT_TRUE (test_evenly_divides(a0, a ));
+  EXPECT_FALSE(test_evenly_divides(c , a0));
+  EXPECT_TRUE (test_evenly_divides(a0, c ));
+  EXPECT_FALSE(test_evenly_divides(b , a0));
+  EXPECT_FALSE(test_evenly_divides(a0, b ));
 
   auto a1 = Shape<_2,_8>{};
-  EXPECT_TRUE (softly_compatible(a1, a1));
-  EXPECT_TRUE (softly_compatible(a , a1));
-  EXPECT_FALSE(softly_compatible(a0, a1));
-  EXPECT_FALSE(softly_compatible(a1, a0));
-  EXPECT_TRUE (softly_compatible(a1, Shape<_2,Shape<_2,_4>>{}));
+  EXPECT_TRUE (test_evenly_divides(a1, a1));
+  EXPECT_FALSE(test_evenly_divides(a , a1));
+  EXPECT_FALSE(test_evenly_divides(a0, a1));
+  EXPECT_FALSE(test_evenly_divides(a1, a0));
+  EXPECT_FALSE(test_evenly_divides(a1, Shape<_2,Shape<_2,_4>>{}));
 
   auto a2 = Shape<Shape<_2,_8>>{};
-  EXPECT_TRUE (softly_compatible(a2, a2));
-  EXPECT_TRUE (softly_compatible(a , a2));
-  EXPECT_FALSE(softly_compatible(c , a2));
-  EXPECT_TRUE (softly_compatible(a0, a2));
-  EXPECT_FALSE(softly_compatible(a2, a0));
+  EXPECT_TRUE (test_evenly_divides(a2, a2));
+  EXPECT_FALSE(test_evenly_divides(a , a2));
+  EXPECT_FALSE(test_evenly_divides(c , a2));
+  EXPECT_FALSE(test_evenly_divides(a0, a2));
+  EXPECT_TRUE (test_evenly_divides(a2, a0));
 
   auto a3 = Shape<Shape<_2,Shape<_4,_2>>>{};
-  EXPECT_TRUE (softly_compatible(a3, a3));
-  EXPECT_TRUE (softly_compatible(a , a3));
-  EXPECT_FALSE(softly_compatible(c , a3));
-  EXPECT_TRUE (softly_compatible(a0, a3));
-  EXPECT_FALSE(softly_compatible(a3, a0));
-  EXPECT_TRUE (softly_compatible(a2, a3));
-  EXPECT_FALSE(softly_compatible(a3, a2));
+  EXPECT_TRUE (test_evenly_divides(a3, a3));
+  EXPECT_FALSE(test_evenly_divides(a , a3));
+  EXPECT_FALSE(test_evenly_divides(c , a3));
+  EXPECT_FALSE(test_evenly_divides(a0, a3));
+  EXPECT_TRUE (test_evenly_divides(a3, a0));
+  EXPECT_FALSE(test_evenly_divides(a2, a3));
+  EXPECT_TRUE (test_evenly_divides(a3, a2));
+  }
+
+  {
+  auto a = 16;
+  auto b = 12;
+  auto c =  8;
+  EXPECT_TRUE (test_evenly_divides(a, a));
+  EXPECT_TRUE (test_evenly_divides(b, b));
+  EXPECT_TRUE (test_evenly_divides(c, c));
+  EXPECT_FALSE(test_evenly_divides(a, b));
+  EXPECT_TRUE (test_evenly_divides(a, c));
+  EXPECT_FALSE(test_evenly_divides(c, a));
+
+  auto a0 = make_shape(16);
+  EXPECT_TRUE (test_evenly_divides(a0, a0));
+  EXPECT_TRUE (test_evenly_divides(a , a0));
+  EXPECT_TRUE (test_evenly_divides(a0, a ));
+  EXPECT_FALSE(test_evenly_divides(c , a0));
+  EXPECT_TRUE (test_evenly_divides(a0, c ));
+  EXPECT_FALSE(test_evenly_divides(b , a0));
+  EXPECT_FALSE(test_evenly_divides(a0, b ));
+
+  auto a1 = make_shape(2, 8);
+  EXPECT_TRUE (test_evenly_divides(a1, a1));
+  EXPECT_FALSE(test_evenly_divides(a , a1));
+  EXPECT_FALSE(test_evenly_divides(a0, a1));
+  EXPECT_FALSE(test_evenly_divides(a1, a0));
+  EXPECT_FALSE(test_evenly_divides(a1, make_shape(2,make_shape(2,4))));
+
+  auto a2 = make_shape(make_shape(2,8));
+  EXPECT_TRUE (test_evenly_divides(a2, a2));
+  EXPECT_FALSE(test_evenly_divides(a , a2));
+  EXPECT_FALSE(test_evenly_divides(c , a2));
+  EXPECT_FALSE(test_evenly_divides(a0, a2));
+  EXPECT_TRUE (test_evenly_divides(a2, a0));
+
+  auto a3 = make_shape(make_shape(2,make_shape(4,2)));
+  EXPECT_TRUE (test_evenly_divides(a3, a3));
+  EXPECT_FALSE(test_evenly_divides(a , a3));
+  EXPECT_FALSE(test_evenly_divides(c , a3));
+  EXPECT_FALSE(test_evenly_divides(a0, a3));
+  EXPECT_TRUE (test_evenly_divides(a3, a0));
+  EXPECT_FALSE(test_evenly_divides(a2, a3));
+  EXPECT_TRUE (test_evenly_divides(a3, a2));
+  }
+
+  {
+  auto a = Shape<_32,_64>{};
+  EXPECT_TRUE (test_evenly_divides(a, Int<128>{}));
+  EXPECT_TRUE (test_evenly_divides(a, Tile<Layout<_8,_2>, _32>{}));
+  EXPECT_FALSE(test_evenly_divides(a, Tile<Layout<_8,_3>, _32>{}));
+  }
 }
diff --git a/test/unit/cute/core/inverse_left.cpp b/test/unit/cute/core/inverse_left.cpp
index 56694094b4..363cb17f98 100644
--- a/test/unit/cute/core/inverse_left.cpp
+++ b/test/unit/cute/core/inverse_left.cpp
@@ -32,9 +32,11 @@
 #include "cutlass_unit_test.h"
 
 #include <cutlass/trace.h>
-
 #include <iostream>
-
+#include <cute/layout.hpp>
+#include <cute/layout_composed.hpp>  // cute::composition
+#include <cute/swizzle.hpp>          // cute::Swizzle
+#include <cute/swizzle_layout.hpp>   // cute::composition
 #include <cute/tensor.hpp>
 
 using namespace cute;
diff --git a/test/unit/cute/core/inverse_right.cpp b/test/unit/cute/core/inverse_right.cpp
index cec83f6d3b..69c8ccdddf 100644
--- a/test/unit/cute/core/inverse_right.cpp
+++ b/test/unit/cute/core/inverse_right.cpp
@@ -33,7 +33,10 @@
 
 #include <cutlass/trace.h>
 #include <iostream>
-
+#include <cute/layout.hpp>
+#include <cute/layout_composed.hpp>  // cute::composition
+#include <cute/swizzle.hpp>          // cute::Swizzle
+#include <cute/swizzle_layout.hpp>   // cute::composition
 #include <cute/tensor.hpp>
 
 using namespace cute;
diff --git a/test/unit/cute/core/math.cpp b/test/unit/cute/core/math.cpp
index a4cbcb8eee..b022df561c 100644
--- a/test/unit/cute/core/math.cpp
+++ b/test/unit/cute/core/math.cpp
@@ -34,6 +34,8 @@
 #include <cutlass/trace.h>
 #include <cute/numeric/integral_constant.hpp>
 #include <cute/numeric/math.hpp>
+#include <cute/swizzle.hpp>
+#include <cute/swizzle_layout.hpp>
 #include <cute/util/type_traits.hpp>
 
 // If cute::gcd returns auto instead of common_type_t<T, U>,
@@ -123,3 +125,11 @@ TEST(CuTe_core, lcm_returns_common_type)
     static_assert(int(result) == 1);
   }
 }
+
+TEST(CuTe_core, max_alignment)
+{
+  {
+    constexpr auto swizzle = cute::Swizzle<3,4,3>{};
+    static_assert(cute::max_alignment(swizzle) == 1 << 4);
+  }
+}
diff --git a/test/unit/cute/core/swizzle_layout.cpp b/test/unit/cute/core/swizzle_layout.cpp
new file mode 100644
index 0000000000..211a40fac4
--- /dev/null
+++ b/test/unit/cute/core/swizzle_layout.cpp
@@ -0,0 +1,116 @@
+/***************************************************************************************************
+ * Copyright (c) 2017 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+#include "cutlass_unit_test.h"
+
+#include <cutlass/trace.h>
+
+#include <cute/tensor_impl.hpp>
+#include <cute/swizzle_layout.hpp>
+
+template <class SwLayout>
+void
+test_swizzle_2d(SwLayout const& sw_layout)
+{
+  using namespace cute;
+
+  auto sw_tensor = make_tensor(counting_iterator<int>{0}, sw_layout);
+
+  //print_tensor(sw_tensor);
+
+  // Dynamic slicing
+  for (int i = 0; i < size<0>(sw_tensor); ++i) {
+    auto sliced_tensor = sw_tensor(i,_);
+    //printf("sw_tensor(%d,_) => ", int(i)); print(sliced_tensor); printf("\n");
+    for (int j = 0; j < size<1>(sw_tensor); ++j) {
+      EXPECT_EQ(sw_tensor(i,j), sliced_tensor(j));
+    }
+  }
+
+  // Static slicing
+  cute::for_each(make_int_sequence<size<0>(sw_tensor)>{}, [&] (auto i) {
+    auto sliced_tensor = sw_tensor(i,_);
+    //printf("sw_tensor(%d,_) => ", int(i)); print(sliced_tensor); printf("\n");
+    // If sw_tensor is static, then sliced_tensor should be too
+    auto sw_tensor_2 = sw_tensor;
+    static_assert(is_static<decltype(layout(sliced_tensor))>::value || not is_static<decltype(layout(sw_tensor_2))>::value);
+    cute::for_each(make_int_sequence<size(sliced_tensor)>{}, [&] (auto j) {
+      EXPECT_EQ(sw_tensor(i,j), sliced_tensor(j));
+    });
+  });
+
+  // Dynamic slicing
+  for (int j = 0; j < size<1>(sw_tensor); ++j) {
+    auto sliced_tensor = sw_tensor(_,j);
+    //printf("sw_tensor(_,%d) => ", int(j)); print(sliced_tensor); printf("\n");
+    for (int i = 0; i < size<0>(sw_tensor); ++i) {
+      EXPECT_EQ(sw_tensor(i,j), sliced_tensor(i));
+    }
+  }
+
+  // Static slicing
+  cute::for_each(make_int_sequence<size<1>(sw_tensor)>{}, [&] (auto j) {
+    auto sliced_tensor = sw_tensor(_,j);
+    //printf("sw_tensor(_,%d) => ", int(j)); print(sliced_tensor); printf("\n");
+    // If sw_tensor is static, then sliced_tensor should be too
+    auto sw_tensor_2 = sw_tensor;
+    static_assert(is_static<decltype(layout(sliced_tensor))>::value || not is_static<decltype(layout(sw_tensor_2))>::value);
+    cute::for_each(make_int_sequence<size(sliced_tensor)>{}, [&] (auto i) {
+      EXPECT_EQ(sw_tensor(i,j), sliced_tensor(i));
+    });
+  });
+}
+
+TEST(CuTe_core, SwizzleLayout)
+{
+  using namespace cute;
+
+  {
+  auto sw_layout = composition(Swizzle<3,0,3>{},
+                               Layout<Shape <_8,_8>,
+                                      Stride<_8,_1>>{});
+  test_swizzle_2d(sw_layout);
+  }
+
+  {
+  auto sw_layout = composition(Swizzle<3,0,-3>{},
+                               Layout<Shape <_8,_8>,
+                                      Stride<_8,_1>>{});
+  test_swizzle_2d(sw_layout);
+  }
+
+  {
+  auto sw_layout = composition(Swizzle<2,1,3>{},
+                               Layout<Shape <Shape < _2,_2,_2>,Shape <_2,_2, _2>>,
+                                      Stride<Stride<_32,_2,_8>,Stride<_4,_1,_16>>>{});
+  test_swizzle_2d(sw_layout);
+  }
+}
diff --git a/test/unit/cute/hopper/cooperative_gemm.cu b/test/unit/cute/hopper/cooperative_gemm.cu
index bab7122eda..c4e2274dfe 100644
--- a/test/unit/cute/hopper/cooperative_gemm.cu
+++ b/test/unit/cute/hopper/cooperative_gemm.cu
@@ -32,6 +32,8 @@
 #include "cutlass_unit_test.h"
 
 #include <cute/tensor.hpp>
+#include <cute/swizzle.hpp> // cute::Swizzle
+#include <cute/swizzle_layout.hpp> // cute::compose(cute::Swizzle)
 
 #include "../cooperative_gemm_common.hpp"
 
diff --git a/test/unit/cute/layout/layout_operator.cu b/test/unit/cute/layout/layout_operator.cu
index 1ebf6c551d..06a823bddd 100644
--- a/test/unit/cute/layout/layout_operator.cu
+++ b/test/unit/cute/layout/layout_operator.cu
@@ -41,6 +41,9 @@
 // Cute includes
 #include <cute/layout.hpp>
 #include <cute/int_tuple.hpp>
+#include <cute/swizzle.hpp>
+#include <cute/layout_composed.hpp>
+#include <cute/swizzle_layout.hpp>
 
 using namespace cutlass;
 using namespace cute;
diff --git a/test/unit/cute/volta/cooperative_gemm.cu b/test/unit/cute/volta/cooperative_gemm.cu
index e8deb8b611..157913f9e5 100644
--- a/test/unit/cute/volta/cooperative_gemm.cu
+++ b/test/unit/cute/volta/cooperative_gemm.cu
@@ -32,6 +32,8 @@
 #include "cutlass_unit_test.h"
 
 #include <cute/tensor.hpp>
+#include <cute/swizzle.hpp> // cute::Swizzle
+#include <cute/swizzle_layout.hpp> // cute::compose(cute::Swizzle)
 
 #include "../cooperative_gemm_common.hpp"
 
diff --git a/test/unit/gemm/device/CMakeLists.txt b/test/unit/gemm/device/CMakeLists.txt
index c2c19538c5..b0f79d24e1 100644
--- a/test/unit/gemm/device/CMakeLists.txt
+++ b/test/unit/gemm/device/CMakeLists.txt
@@ -67,6 +67,7 @@ else()
   cutlass_test_unit_gemm_device_sparse_tensorop_sm80
   cutlass_test_unit_gemv_device
   cutlass_test_unit_gemm_device_tensorop_sm90
+  cutlass_test_unit_sparse_gemm_device_tensorop_sm90
   cutlass_test_unit_gemm_device_tensorop_cluster_multicast_sm90
 )
 
@@ -276,14 +277,38 @@ cutlass_test_unit_add_executable(
   BATCH_SIZE 4
 
   # Upcast on Operand A
-  gemm_universal_u8t_f16n_f16t_mixed_input_tensor_op_f16_sm80.cu
-  gemm_universal_s8t_f16n_f16t_mixed_input_tensor_op_f16_sm80.cu
+  gemm_universal_s8t_f16n_f32t_mixed_input_tensor_op_f32_sm80.cu
+  gemm_universal_u8t_f16n_f32t_mixed_input_tensor_op_f32_sm80.cu
+  gemm_universal_s8t_f16n_f16t_mixed_input_tensor_op_f32_sm80.cu
+  gemm_universal_u8t_f16n_f16t_mixed_input_tensor_op_f32_sm80.cu
+
+  gemm_universal_s8t_bf16n_f32t_mixed_input_tensor_op_f32_sm80.cu
+  gemm_universal_u8t_bf16n_f32t_mixed_input_tensor_op_f32_sm80.cu
   gemm_universal_s8t_bf16n_bf16t_mixed_input_tensor_op_f32_sm80.cu
+  gemm_universal_u8t_bf16n_bf16t_mixed_input_tensor_op_f32_sm80.cu
+
+  gemm_universal_s8t_f16n_f16t_mixed_input_tensor_op_f16_sm80.cu
+  gemm_universal_u8t_f16n_f16t_mixed_input_tensor_op_f16_sm80.cu
+
+  gemm_universal_s4t_s8n_s32t_mixed_input_tensor_op_s32_sm80.cu
+  gemm_universal_s4t_s8n_s8t_mixed_input_tensor_op_s32_sm80.cu
 
   # Upcast on Operand B
-  gemm_universal_f16t_u8n_f16t_mixed_input_tensor_op_f16_sm80.cu
-  gemm_universal_f16t_s8n_f16t_mixed_input_tensor_op_f16_sm80.cu
+  gemm_universal_f16t_s8n_f32t_mixed_input_tensor_op_f32_sm80.cu
+  gemm_universal_f16t_u8n_f32t_mixed_input_tensor_op_f32_sm80.cu
+  gemm_universal_f16t_s8n_f16t_mixed_input_tensor_op_f32_sm80.cu
+  gemm_universal_f16t_u8n_f16t_mixed_input_tensor_op_f32_sm80.cu
+
+  gemm_universal_bf16t_s8n_f32t_mixed_input_tensor_op_f32_sm80.cu
+  gemm_universal_bf16t_u8n_f32t_mixed_input_tensor_op_f32_sm80.cu
   gemm_universal_bf16t_s8n_bf16t_mixed_input_tensor_op_f32_sm80.cu
+  gemm_universal_bf16t_u8n_bf16t_mixed_input_tensor_op_f32_sm80.cu
+
+  gemm_universal_f16t_s8n_f16t_mixed_input_tensor_op_f16_sm80.cu
+  gemm_universal_f16t_u8n_f16t_mixed_input_tensor_op_f16_sm80.cu
+
+  gemm_universal_s8t_s4n_s32t_mixed_input_tensor_op_s32_sm80.cu
+  gemm_universal_s8t_s4n_s8t_mixed_input_tensor_op_s32_sm80.cu
 )
 
 cutlass_test_unit_add_executable(
@@ -339,13 +364,40 @@ cutlass_test_unit_add_executable(
 cutlass_test_unit_add_executable(
   cutlass_test_unit_gemm_device_tensorop_sm90_ptr_array
   sm90_gemm_f16_f16_f16_tensor_op_f32_ptr_array.cu
+  sm90_gemm_f16_f16_f16_tensor_op_f32_ptr_array_pingpong.cu
 )
 
 # Group Gemm test
 cutlass_test_unit_add_executable(
   cutlass_test_unit_gemm_device_tensorop_sm90_group_gemm
   sm90_gemm_f16_f16_f16_tensor_op_f32_group_gemm.cu
+  sm90_gemm_f16_f16_f16_tensor_op_f32_group_gemm_pingpong.cu
+)
+
+# Sparse tests
+# Sparse kernels trigger an ICE in gcc 7.5
+if (NOT (CMAKE_CXX_COMPILER_ID STREQUAL "GNU" AND CMAKE_CXX_COMPILER_VERSION VERSION_LESS 8.0))
+cutlass_test_unit_add_executable(
+  cutlass_test_unit_sparse_gemm_device_tensorop_sm90
+
+  # No batching of source to control compiler memory usage
+  BATCH_SOURCES ON
+  BATCH_SIZE 1
+
+  sm90_sparse_gemm_s8_s8_s32_tensor_op_s32.cu
+  sm90_sparse_gemm_f8_f8_f32_tensor_op_f32.cu
+  sm90_sparse_gemm_f16_f16_f32_tensor_op_f32.cu
+  sm90_sparse_gemm_tf32_tf32_f32_tensor_op_f32.cu
 )
+else()
+cutlass_test_unit_add_executable(
+  cutlass_test_unit_sparse_gemm_device_tensorop_sm90
+
+  # No batching of source to control compiler memory usage
+  BATCH_SOURCES ON
+  BATCH_SIZE 1
+)
+endif()
 
 # Fused epilogue tests
 cutlass_test_unit_add_executable(
diff --git a/test/unit/gemm/device/gemm_f16n_f16t_f16t_tensor_op_f32_sm80.cu b/test/unit/gemm/device/gemm_f16n_f16t_f16t_tensor_op_f32_sm80.cu
index 5186b818af..794ce6fc73 100644
--- a/test/unit/gemm/device/gemm_f16n_f16t_f16t_tensor_op_f32_sm80.cu
+++ b/test/unit/gemm/device/gemm_f16n_f16t_f16t_tensor_op_f32_sm80.cu
@@ -182,7 +182,7 @@ TEST(SM80_Device_Gemm_f16n_f16t_f16t_tensor_op_f32, 32x256x64_32x64x64) {
       cutlass::epilogue::thread::LinearCombination<
           ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
           ElementAccumulator, ElementAccumulator>,
-      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 5>;
+      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 4>;
 
   EXPECT_TRUE(test::gemm::device::TestAllGemm<Gemm>());
 }
@@ -220,7 +220,7 @@ TEST(SM80_Device_Gemm_f16n_f16t_f16t_tensor_op_f32, 256x32x64_64x32x64) {
       cutlass::epilogue::thread::LinearCombination<
           ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
           ElementAccumulator, ElementAccumulator>,
-      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 5>;
+      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 4>;
 
   EXPECT_TRUE(test::gemm::device::TestAllGemm<Gemm>());
 }
@@ -257,7 +257,7 @@ TEST(SM80_Device_Gemm_f16n_f16t_f16t_tensor_op_f32, 16x256x64_16x64x64) {
       cutlass::epilogue::thread::LinearCombination<
           ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
           ElementAccumulator, ElementAccumulator>,
-      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 5>;
+      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 4>;
 
   EXPECT_TRUE(test::gemm::device::TestAllGemm<Gemm>());
 }
@@ -295,7 +295,7 @@ TEST(SM80_Device_Gemm_f16n_f16t_f16t_tensor_op_f32, 256x16x64_64x16x64) {
       cutlass::epilogue::thread::LinearCombination<
           ElementOutput, 64 / cutlass::sizeof_bits<ElementOutput>::value,
           ElementAccumulator, ElementAccumulator>,
-      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 5>;
+      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 4>;
 
   EXPECT_TRUE(test::gemm::device::TestAllGemm<Gemm>());
 }
diff --git a/test/unit/gemm/device/gemm_f16n_f16t_f32t_tensor_op_f32_sm80.cu b/test/unit/gemm/device/gemm_f16n_f16t_f32t_tensor_op_f32_sm80.cu
index f94d6d603b..29fda22ce4 100644
--- a/test/unit/gemm/device/gemm_f16n_f16t_f32t_tensor_op_f32_sm80.cu
+++ b/test/unit/gemm/device/gemm_f16n_f16t_f32t_tensor_op_f32_sm80.cu
@@ -37,6 +37,7 @@
 #include "../../common/cutlass_unit_test.h"
 #include "cutlass/cutlass.h"
 #include "cutlass/gemm/device/gemm.h"
+#include "cutlass/gemm/device/gemm_universal.h"
 #include "cutlass/util/host_tensor.h"
 #include "cutlass/util/reference/host/gemm.h"
 #include "cutlass/util/reference/host/tensor_compare.h"
diff --git a/test/unit/gemm/device/gemm_f16n_f16t_f32t_tensor_op_f32_sparse_sm80.cu b/test/unit/gemm/device/gemm_f16n_f16t_f32t_tensor_op_f32_sparse_sm80.cu
index f2ddaa5a94..81aa8016e4 100644
--- a/test/unit/gemm/device/gemm_f16n_f16t_f32t_tensor_op_f32_sparse_sm80.cu
+++ b/test/unit/gemm/device/gemm_f16n_f16t_f32t_tensor_op_f32_sparse_sm80.cu
@@ -320,24 +320,6 @@ TEST(SM80_Device_Sparse_Gemm_f16n_f16t_f32t_tensor_op_f32, 256x32x64_64x32x64) {
   EXPECT_TRUE(test::gemm::device::TestAllSparseGemm<Gemm>());
 }
 
-TEST(SM80_Device_Sparse_Gemm_f16n_f16t_f32t_tensor_op_f32, 256x32x128_64x32x128) {
-  using ElementOutput = float;
-  using ElementAccumulator = float;
-
-  using Gemm = cutlass::gemm::device::GemmSparseUniversal<
-      cutlass::half_t, cutlass::layout::ColumnMajor, cutlass::half_t,
-      cutlass::layout::RowMajor, ElementOutput, cutlass::layout::RowMajor,
-      ElementAccumulator, cutlass::arch::OpClassTensorOp, cutlass::arch::Sm80,
-      cutlass::gemm::GemmShape<256, 32, 128>,
-      cutlass::gemm::GemmShape<64, 32, 128>, cutlass::gemm::GemmShape<16, 8, 32>,
-      cutlass::epilogue::thread::LinearCombination<
-          ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
-          ElementAccumulator, ElementAccumulator>,
-      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 6>;
-
-  EXPECT_TRUE(test::gemm::device::TestAllSparseGemm<Gemm>());
-}
-
 TEST(SM80_Device_Sparse_Gemm_f16n_f16t_f32t_tensor_op_f32, 32x128x128_32x32x128) {
   using ElementOutput = float;
   using ElementAccumulator = float;
@@ -351,45 +333,7 @@ TEST(SM80_Device_Sparse_Gemm_f16n_f16t_f32t_tensor_op_f32, 32x128x128_32x32x128)
       cutlass::epilogue::thread::LinearCombination<
           ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
           ElementAccumulator, ElementAccumulator>,
-      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 6>;
-
-  EXPECT_TRUE(test::gemm::device::TestAllSparseGemm<Gemm>());
-}
-
-#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ < 900)
-TEST(SM80_Device_Sparse_Gemm_f16n_f16t_f32t_tensor_op_f32, 32x256x64_32x64x64) {
-  using ElementOutput = float;
-  using ElementAccumulator = float;
-
-  using Gemm = cutlass::gemm::device::GemmSparseUniversal<
-      cutlass::half_t, cutlass::layout::ColumnMajor, cutlass::half_t,
-      cutlass::layout::RowMajor, ElementOutput, cutlass::layout::RowMajor,
-      ElementAccumulator, cutlass::arch::OpClassTensorOp, cutlass::arch::Sm80,
-      cutlass::gemm::GemmShape<32, 256, 64>,
-      cutlass::gemm::GemmShape<32, 64, 64>, cutlass::gemm::GemmShape<16, 8, 32>,
-      cutlass::epilogue::thread::LinearCombination<
-          ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
-          ElementAccumulator, ElementAccumulator>,
-      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 6>;
-
-  EXPECT_TRUE(test::gemm::device::TestAllSparseGemm<Gemm>());
-}
-#endif
-
-TEST(SM80_Device_Sparse_Gemm_f16n_f16t_f32t_tensor_op_f32, 32x256x128_32x64x128) {
-  using ElementOutput = float;
-  using ElementAccumulator = float;
-
-  using Gemm = cutlass::gemm::device::GemmSparseUniversal<
-      cutlass::half_t, cutlass::layout::ColumnMajor, cutlass::half_t,
-      cutlass::layout::RowMajor, ElementOutput, cutlass::layout::RowMajor,
-      ElementAccumulator, cutlass::arch::OpClassTensorOp, cutlass::arch::Sm80,
-      cutlass::gemm::GemmShape<32, 256, 128>,
-      cutlass::gemm::GemmShape<32, 64, 128>, cutlass::gemm::GemmShape<16, 8, 32>,
-      cutlass::epilogue::thread::LinearCombination<
-          ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
-          ElementAccumulator, ElementAccumulator>,
-      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 6>;
+      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 3>;
 
   EXPECT_TRUE(test::gemm::device::TestAllSparseGemm<Gemm>());
 }
@@ -461,10 +405,11 @@ TEST(SM80_Device_Sparse_Gemm_f16n_f16t_f32t_tensor_op_f32, 256x16x128_64x16x128)
       cutlass::epilogue::thread::LinearCombination<
           ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
           ElementAccumulator, ElementAccumulator>,
-      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 6>;
+      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 4>;
 
   EXPECT_TRUE(test::gemm::device::TestAllSparseGemm<Gemm>());
 }
+
 ////////////////////////////////////////////////////////////////////////////////
 
 #endif  // CUTLASS_ARCH_SPARSE_MMA_SM80_SUPPORTED
diff --git a/test/unit/gemm/device/gemm_f16t_f16n_f32t_tensor_op_f32_sparse_sm80.cu b/test/unit/gemm/device/gemm_f16t_f16n_f32t_tensor_op_f32_sparse_sm80.cu
index 3b1e85e750..01e191ba64 100644
--- a/test/unit/gemm/device/gemm_f16t_f16n_f32t_tensor_op_f32_sparse_sm80.cu
+++ b/test/unit/gemm/device/gemm_f16t_f16n_f32t_tensor_op_f32_sparse_sm80.cu
@@ -279,45 +279,7 @@ TEST(SM80_Device_Sparse_Gemm_f16t_f16n_f32t_tensor_op_f32, 32x128x128_32x32x128)
       cutlass::epilogue::thread::LinearCombination<
           ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
           ElementAccumulator, ElementAccumulator>,
-      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 6>;
-
-  EXPECT_TRUE(test::gemm::device::TestAllSparseGemm<Gemm>());
-}
-
-#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ < 900)
-TEST(SM80_Device_Sparse_Gemm_f16t_f16n_f32t_tensor_op_f32, 32x256x64_32x64x64) {
-  using ElementOutput = float;
-  using ElementAccumulator = float;
-
-  using Gemm = cutlass::gemm::device::GemmSparseUniversal<
-      cutlass::half_t, cutlass::layout::RowMajor, cutlass::half_t,
-      cutlass::layout::ColumnMajor, ElementOutput, cutlass::layout::RowMajor,
-      ElementAccumulator, cutlass::arch::OpClassTensorOp, cutlass::arch::Sm80,
-      cutlass::gemm::GemmShape<32, 256, 64>,
-      cutlass::gemm::GemmShape<32, 64, 64>, cutlass::gemm::GemmShape<16, 8, 32>,
-      cutlass::epilogue::thread::LinearCombination<
-          ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
-          ElementAccumulator, ElementAccumulator>,
-      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 6>;
-
-  EXPECT_TRUE(test::gemm::device::TestAllSparseGemm<Gemm>());
-}
-#endif
-
-TEST(SM80_Device_Sparse_Gemm_f16t_f16n_f32t_tensor_op_f32, 32x256x128_32x64x128) {
-  using ElementOutput = float;
-  using ElementAccumulator = float;
-
-  using Gemm = cutlass::gemm::device::GemmSparseUniversal<
-      cutlass::half_t, cutlass::layout::RowMajor, cutlass::half_t,
-      cutlass::layout::ColumnMajor, ElementOutput, cutlass::layout::RowMajor,
-      ElementAccumulator, cutlass::arch::OpClassTensorOp, cutlass::arch::Sm80,
-      cutlass::gemm::GemmShape<32, 256, 128>,
-      cutlass::gemm::GemmShape<32, 64, 128>, cutlass::gemm::GemmShape<16, 8, 32>,
-      cutlass::epilogue::thread::LinearCombination<
-          ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
-          ElementAccumulator, ElementAccumulator>,
-      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 6>;
+      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 3>;
 
   EXPECT_TRUE(test::gemm::device::TestAllSparseGemm<Gemm>());
 }
@@ -376,24 +338,6 @@ TEST(SM80_Device_Sparse_Gemm_f16t_f16n_f32t_tensor_op_f32, 256x32x64_64x32x64) {
   EXPECT_TRUE(test::gemm::device::TestAllSparseGemm<Gemm>());
 }
 
-TEST(SM80_Device_Sparse_Gemm_f16t_f16n_f32t_tensor_op_f32, 256x32x128_64x32x128) {
-  using ElementOutput = float;
-  using ElementAccumulator = float;
-
-  using Gemm = cutlass::gemm::device::GemmSparseUniversal<
-      cutlass::half_t, cutlass::layout::RowMajor, cutlass::half_t,
-      cutlass::layout::ColumnMajor, ElementOutput, cutlass::layout::RowMajor,
-      ElementAccumulator, cutlass::arch::OpClassTensorOp, cutlass::arch::Sm80,
-      cutlass::gemm::GemmShape<256, 32, 128>,
-      cutlass::gemm::GemmShape<64, 32, 128>, cutlass::gemm::GemmShape<16, 8, 32>,
-      cutlass::epilogue::thread::LinearCombination<
-          ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
-          ElementAccumulator, ElementAccumulator>,
-      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 6>;
-
-  EXPECT_TRUE(test::gemm::device::TestAllSparseGemm<Gemm>());
-}
-
 TEST(SM80_Device_Sparse_Gemm_f16t_f16n_f32t_tensor_op_f32, 128x16x64_32x16x64) {
   using ElementOutput = float;
   using ElementAccumulator = float;
@@ -448,24 +392,6 @@ TEST(SM80_Device_Sparse_Gemm_f16t_f16n_f32t_tensor_op_f32, 256x16x64_64x16x64) {
   EXPECT_TRUE(test::gemm::device::TestAllSparseGemm<Gemm>());
 }
 
-TEST(SM80_Device_Sparse_Gemm_f16t_f16n_f32t_tensor_op_f32, 256x16x128_64x16x128) {
-  using ElementOutput = float;
-  using ElementAccumulator = float;
-
-  using Gemm = cutlass::gemm::device::GemmSparseUniversal<
-      cutlass::half_t, cutlass::layout::RowMajor, cutlass::half_t,
-      cutlass::layout::ColumnMajor, ElementOutput, cutlass::layout::RowMajor,
-      ElementAccumulator, cutlass::arch::OpClassTensorOp, cutlass::arch::Sm80,
-      cutlass::gemm::GemmShape<256, 16, 128>,
-      cutlass::gemm::GemmShape<64, 16, 128>, cutlass::gemm::GemmShape<16, 8, 32>,
-      cutlass::epilogue::thread::LinearCombination<
-          ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
-          ElementAccumulator, ElementAccumulator>,
-      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 6>;
-
-  EXPECT_TRUE(test::gemm::device::TestAllSparseGemm<Gemm>());
-}
-
 ////////////////////////////////////////////////////////////////////////////////
 
 #endif  // CUTLASS_ARCH_SPARSE_MMA_SM80_SUPPORTED
diff --git a/test/unit/gemm/device/gemm_f32n_f32t_f32t_tensor_op_f32_sparse_sm80.cu b/test/unit/gemm/device/gemm_f32n_f32t_f32t_tensor_op_f32_sparse_sm80.cu
index 8ce0f138f2..452250fb3d 100644
--- a/test/unit/gemm/device/gemm_f32n_f32t_f32t_tensor_op_f32_sparse_sm80.cu
+++ b/test/unit/gemm/device/gemm_f32n_f32t_f32t_tensor_op_f32_sparse_sm80.cu
@@ -449,75 +449,12 @@ TEST(SM80_Device_Sparse_Gemm_f32n_f32t_f32t_tensor_op_f32, 32x128x64_32x32x64) {
       ElementAccumulator
     >,
     cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>,
-    6 
+    3 
   >;
 
   EXPECT_TRUE(test::gemm::device::TestAllSparseGemm<Gemm>());
 }
 
-#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ < 900)
-TEST(SM80_Device_Sparse_Gemm_f32n_f32t_f32t_tensor_op_f32, 32x256x32_32x64x32) {
-
-  using ElementOutput = float;
-  using ElementAccumulator = float;
-
-  using Gemm = cutlass::gemm::device::GemmSparseUniversal<
-    float,
-    cutlass::layout::ColumnMajor,
-    float,
-    cutlass::layout::RowMajor,
-    float,
-    cutlass::layout::RowMajor,
-    float,
-    cutlass::arch::OpClassTensorOp,
-    cutlass::arch::Sm80,
-    cutlass::gemm::GemmShape<32, 256, 32>,
-    cutlass::gemm::GemmShape<32, 64, 32>,
-    cutlass::gemm::GemmShape<16, 8, 16>,
-    cutlass::epilogue::thread::LinearCombination<
-      ElementOutput,
-      128 / cutlass::sizeof_bits<ElementOutput>::value,
-      ElementAccumulator,
-      ElementAccumulator
-    >,
-    cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>,
-    6 
-  >;
-
-  EXPECT_TRUE(test::gemm::device::TestAllSparseGemm<Gemm>());
-}
-#endif
-
-TEST(SM80_Device_Sparse_Gemm_f32n_f32t_f32t_tensor_op_f32, 32x256x64_32x64x64) {
-
-  using ElementOutput = float;
-  using ElementAccumulator = float;
-
-  using Gemm = cutlass::gemm::device::GemmSparseUniversal<
-    float,
-    cutlass::layout::ColumnMajor,
-    float,
-    cutlass::layout::RowMajor,
-    float,
-    cutlass::layout::RowMajor,
-    float,
-    cutlass::arch::OpClassTensorOp,
-    cutlass::arch::Sm80,
-    cutlass::gemm::GemmShape<32, 256, 64>,
-    cutlass::gemm::GemmShape<32, 64, 64>,
-    cutlass::gemm::GemmShape<16, 8, 16>,
-    cutlass::epilogue::thread::LinearCombination<
-      ElementOutput,
-      128 / cutlass::sizeof_bits<ElementOutput>::value,
-      ElementAccumulator,
-      ElementAccumulator
-    >,
-    cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>,
-    6 
-  >;
-
-  EXPECT_TRUE(test::gemm::device::TestAllSparseGemm<Gemm>());
-}
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
 #endif // #if defined(CUTLASS_ARCH_SPARSE_MMA_SM80_SUPPORTED)
diff --git a/test/unit/gemm/device/gemm_f32t_f32n_f32t_tensor_op_f32_sparse_sm80.cu b/test/unit/gemm/device/gemm_f32t_f32n_f32t_tensor_op_f32_sparse_sm80.cu
index 382b0b2261..5fd7d7d475 100644
--- a/test/unit/gemm/device/gemm_f32t_f32n_f32t_tensor_op_f32_sparse_sm80.cu
+++ b/test/unit/gemm/device/gemm_f32t_f32n_f32t_tensor_op_f32_sparse_sm80.cu
@@ -449,71 +449,7 @@ TEST(SM80_Device_Sparse_Gemm_f32t_f32n_f32t_tensor_op_f32, 32x128x64_32x32x64) {
       ElementAccumulator
     >,
     cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>,
-    6 
-  >;
-
-  EXPECT_TRUE(test::gemm::device::TestAllSparseGemm<Gemm>());
-}
-
-#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ < 900)
-TEST(SM80_Device_Sparse_Gemm_f32t_f32n_f32t_tensor_op_f32, 32x256x32_32x64x32) {
-
-  using ElementOutput = float;
-  using ElementAccumulator = float;
-
-  using Gemm = cutlass::gemm::device::GemmSparseUniversal<
-    float,
-    cutlass::layout::RowMajor,
-    float,
-    cutlass::layout::ColumnMajor,
-    float,
-    cutlass::layout::RowMajor,
-    float,
-    cutlass::arch::OpClassTensorOp,
-    cutlass::arch::Sm80,
-    cutlass::gemm::GemmShape<32, 256, 32>,
-    cutlass::gemm::GemmShape<32, 64, 32>,
-    cutlass::gemm::GemmShape<16, 8, 16>,
-    cutlass::epilogue::thread::LinearCombination<
-      ElementOutput,
-      128 / cutlass::sizeof_bits<ElementOutput>::value,
-      ElementAccumulator,
-      ElementAccumulator
-    >,
-    cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>,
-    6 
-  >;
-
-  EXPECT_TRUE(test::gemm::device::TestAllSparseGemm<Gemm>());
-}
-#endif
-
-TEST(SM80_Device_Sparse_Gemm_f32t_f32n_f32t_tensor_op_f32, 32x256x64_32x64x64) {
-
-  using ElementOutput = float;
-  using ElementAccumulator = float;
-
-  using Gemm = cutlass::gemm::device::GemmSparseUniversal<
-    float,
-    cutlass::layout::RowMajor,
-    float,
-    cutlass::layout::ColumnMajor,
-    float,
-    cutlass::layout::RowMajor,
-    float,
-    cutlass::arch::OpClassTensorOp,
-    cutlass::arch::Sm80,
-    cutlass::gemm::GemmShape<32, 256, 64>,
-    cutlass::gemm::GemmShape<32, 64, 64>,
-    cutlass::gemm::GemmShape<16, 8, 16>,
-    cutlass::epilogue::thread::LinearCombination<
-      ElementOutput,
-      128 / cutlass::sizeof_bits<ElementOutput>::value,
-      ElementAccumulator,
-      ElementAccumulator
-    >,
-    cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>,
-    6 
+    3 
   >;
 
   EXPECT_TRUE(test::gemm::device::TestAllSparseGemm<Gemm>());
@@ -612,37 +548,6 @@ TEST(SM80_Device_Sparse_Gemm_f32t_f32n_f32t_tensor_op_f32, 256x32x32_64x32x32) {
   EXPECT_TRUE(test::gemm::device::TestAllSparseGemm<Gemm>());
 }
 
-TEST(SM80_Device_Sparse_Gemm_f32t_f32n_f32t_tensor_op_f32, 256x32x64_64x32x64) {
-
-  using ElementOutput = float;
-  using ElementAccumulator = float;
-
-  using Gemm = cutlass::gemm::device::GemmSparseUniversal<
-    float,
-    cutlass::layout::RowMajor,
-    float,
-    cutlass::layout::ColumnMajor,
-    float,
-    cutlass::layout::RowMajor,
-    float,
-    cutlass::arch::OpClassTensorOp,
-    cutlass::arch::Sm80,
-    cutlass::gemm::GemmShape<256, 32, 64>,
-    cutlass::gemm::GemmShape<64, 32, 64>,
-    cutlass::gemm::GemmShape<16, 8, 16>,
-    cutlass::epilogue::thread::LinearCombination<
-      ElementOutput,
-      128 / cutlass::sizeof_bits<ElementOutput>::value,
-      ElementAccumulator,
-      ElementAccumulator
-    >,
-    cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>,
-    6 
-  >;
-
-  EXPECT_TRUE(test::gemm::device::TestAllSparseGemm<Gemm>());
-}
-
 TEST(SM80_Device_Sparse_Gemm_f32t_f32n_f32t_tensor_op_f32, 128x16x32_32x16x32) {
 
   using ElementOutput = float;
@@ -736,37 +641,6 @@ TEST(SM80_Device_Sparse_Gemm_f32t_f32n_f32t_tensor_op_f32, 256x16x32_64x16x32) {
   EXPECT_TRUE(test::gemm::device::TestAllSparseGemm<Gemm>());
 }
 
-TEST(SM80_Device_Sparse_Gemm_f32t_f32n_f32t_tensor_op_f32, 256x16x64_64x16x64) {
-
-  using ElementOutput = float;
-  using ElementAccumulator = float;
-
-  using Gemm = cutlass::gemm::device::GemmSparseUniversal<
-    float,
-    cutlass::layout::RowMajor,
-    float,
-    cutlass::layout::ColumnMajor,
-    float,
-    cutlass::layout::RowMajor,
-    float,
-    cutlass::arch::OpClassTensorOp,
-    cutlass::arch::Sm80,
-    cutlass::gemm::GemmShape<256, 16, 64>,
-    cutlass::gemm::GemmShape<64, 16, 64>,
-    cutlass::gemm::GemmShape<16, 8, 16>,
-    cutlass::epilogue::thread::LinearCombination<
-      ElementOutput,
-      128 / cutlass::sizeof_bits<ElementOutput>::value,
-      ElementAccumulator,
-      ElementAccumulator
-    >,
-    cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>,
-    6 
-  >;
-
-  EXPECT_TRUE(test::gemm::device::TestAllSparseGemm<Gemm>());
-}
-
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
 #endif // #if defined(CUTLASS_ARCH_SPARSE_MMA_SM80_SUPPORTED)
diff --git a/test/unit/gemm/device/gemm_s4t_s4n_s32t_tensor_op_s32_sparse_sm80.cu b/test/unit/gemm/device/gemm_s4t_s4n_s32t_tensor_op_s32_sparse_sm80.cu
index bea3e946b8..73d45d5489 100644
--- a/test/unit/gemm/device/gemm_s4t_s4n_s32t_tensor_op_s32_sparse_sm80.cu
+++ b/test/unit/gemm/device/gemm_s4t_s4n_s32t_tensor_op_s32_sparse_sm80.cu
@@ -275,7 +275,7 @@ TEST(SM80_Device_Sparse_Gemm_s4t_s4n_s32t_tensor_op_s32, 32x128x512_32x32x512) {
       cutlass::epilogue::thread::LinearCombinationClamp<
           ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
           ElementAccumulator, ElementCompute>,
-      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 6>;
+      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 3>;
 
   EXPECT_TRUE(test::gemm::device::TestAllSparseGemm<Gemm>());
 }
@@ -313,26 +313,7 @@ TEST(SM80_Device_Sparse_Gemm_s4t_s4n_s32t_tensor_op_s32, 32x256x256_32x64x256) {
       cutlass::epilogue::thread::LinearCombinationClamp<
           ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
           ElementAccumulator, ElementCompute>,
-      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 6>;
-
-  EXPECT_TRUE(test::gemm::device::TestAllSparseGemm<Gemm>());
-}
-
-TEST(SM80_Device_Sparse_Gemm_s4t_s4n_s32t_tensor_op_s32, 32x256x512_32x64x512) {
-  using ElementOutput = int32_t;
-  using ElementAccumulator = int32_t;
-  using ElementCompute = int32_t;
-
-  using Gemm = cutlass::gemm::device::GemmSparseUniversal<
-      cutlass::int4b_t, cutlass::layout::RowMajor, cutlass::int4b_t,
-      cutlass::layout::ColumnMajor, ElementOutput, cutlass::layout::RowMajor,
-      ElementAccumulator, cutlass::arch::OpClassTensorOp, cutlass::arch::Sm80,
-      cutlass::gemm::GemmShape<32, 256, 512>,
-      cutlass::gemm::GemmShape<32, 64, 512>, cutlass::gemm::GemmShape<16, 8, 128>,
-      cutlass::epilogue::thread::LinearCombinationClamp<
-          ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
-          ElementAccumulator, ElementCompute>,
-      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 6>;
+      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 4>;
 
   EXPECT_TRUE(test::gemm::device::TestAllSparseGemm<Gemm>());
 }
@@ -351,26 +332,7 @@ TEST(SM80_Device_Sparse_Gemm_s4t_s4n_s32t_tensor_op_s32, 16x128x512_16x32x512) {
       cutlass::epilogue::thread::LinearCombinationClamp<
           ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
           ElementAccumulator, ElementCompute>,
-      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 6>;
-
-  EXPECT_TRUE(test::gemm::device::TestAllSparseGemm<Gemm>());
-}
-
-TEST(SM80_Device_Sparse_Gemm_s4t_s4n_s32t_tensor_op_s32, 16x256x512_16x64x512) {
-  using ElementOutput = int32_t;
-  using ElementAccumulator = int32_t;
-  using ElementCompute = int32_t;
-
-  using Gemm = cutlass::gemm::device::GemmSparseUniversal<
-      cutlass::int4b_t, cutlass::layout::RowMajor, cutlass::int4b_t,
-      cutlass::layout::ColumnMajor, ElementOutput, cutlass::layout::RowMajor,
-      ElementAccumulator, cutlass::arch::OpClassTensorOp, cutlass::arch::Sm80,
-      cutlass::gemm::GemmShape<16, 256, 512>,
-      cutlass::gemm::GemmShape<16, 64, 512>, cutlass::gemm::GemmShape<16, 8, 128>,
-      cutlass::epilogue::thread::LinearCombinationClamp<
-          ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
-          ElementAccumulator, ElementCompute>,
-      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 6>;
+      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 4>;
 
   EXPECT_TRUE(test::gemm::device::TestAllSparseGemm<Gemm>());
 }
@@ -408,7 +370,7 @@ TEST(SM80_Device_Sparse_Gemm_s4t_s4n_s32t_tensor_op_s32, 128x32x512_32x32x512) {
       cutlass::epilogue::thread::LinearCombinationClamp<
           ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
           ElementAccumulator, ElementCompute>,
-      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 6>;
+      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 3>;
 
   EXPECT_TRUE(test::gemm::device::TestAllSparseGemm<Gemm>());
 }
@@ -432,25 +394,6 @@ TEST(SM80_Device_Sparse_Gemm_s4t_s4n_s32t_tensor_op_s32, 256x32x256_64x32x256) {
   EXPECT_TRUE(test::gemm::device::TestAllSparseGemm<Gemm>());
 }
 
-TEST(SM80_Device_Sparse_Gemm_s4t_s4n_s32t_tensor_op_s32, 256x32x512_64x32x512) {
-  using ElementOutput = int32_t;
-  using ElementAccumulator = int32_t;
-  using ElementCompute = int32_t;
-
-  using Gemm = cutlass::gemm::device::GemmSparseUniversal<
-      cutlass::int4b_t, cutlass::layout::RowMajor, cutlass::int4b_t,
-      cutlass::layout::ColumnMajor, ElementOutput, cutlass::layout::RowMajor,
-      ElementAccumulator, cutlass::arch::OpClassTensorOp, cutlass::arch::Sm80,
-      cutlass::gemm::GemmShape<256, 32, 512>,
-      cutlass::gemm::GemmShape<64, 32, 512>, cutlass::gemm::GemmShape<16, 8, 128>,
-      cutlass::epilogue::thread::LinearCombinationClamp<
-          ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
-          ElementAccumulator, ElementCompute>,
-      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 6>;
-
-  EXPECT_TRUE(test::gemm::device::TestAllSparseGemm<Gemm>());
-}
-
 TEST(SM80_Device_Sparse_Gemm_s4t_s4n_s32t_tensor_op_s32, 128x16x256_32x16x256) {
   using ElementOutput = int32_t;
   using ElementAccumulator = int32_t;
@@ -508,25 +451,6 @@ TEST(SM80_Device_Sparse_Gemm_s4t_s4n_s32t_tensor_op_s32, 256x16x256_16x64x256) {
   EXPECT_TRUE(test::gemm::device::TestAllSparseGemm<Gemm>());
 }
 
-TEST(SM80_Device_Sparse_Gemm_s4t_s4n_s32t_tensor_op_s32, 256x16x512_16x64x512) {
-  using ElementOutput = int32_t;
-  using ElementAccumulator = int32_t;
-  using ElementCompute = int32_t;
-
-  using Gemm = cutlass::gemm::device::GemmSparseUniversal<
-      cutlass::int4b_t, cutlass::layout::RowMajor, cutlass::int4b_t,
-      cutlass::layout::ColumnMajor, ElementOutput, cutlass::layout::RowMajor,
-      ElementAccumulator, cutlass::arch::OpClassTensorOp, cutlass::arch::Sm80,
-      cutlass::gemm::GemmShape<256, 16, 512>,
-      cutlass::gemm::GemmShape<64, 16, 512>, cutlass::gemm::GemmShape<16, 8, 128>,
-      cutlass::epilogue::thread::LinearCombinationClamp<
-          ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
-          ElementAccumulator, ElementCompute>,
-      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 6>;
-
-  EXPECT_TRUE(test::gemm::device::TestAllSparseGemm<Gemm>());
-}
-
 ////////////////////////////////////////////////////////////////////////////////
 
 #endif // defined(CUTLASS_ARCH_SPARSE_MMA_SM80_SUPPORTED)
diff --git a/test/unit/gemm/device/gemm_s8t_s8n_s32t_tensor_op_s32_sparse_sm80.cu b/test/unit/gemm/device/gemm_s8t_s8n_s32t_tensor_op_s32_sparse_sm80.cu
index 4cb879b635..96b56322cf 100644
--- a/test/unit/gemm/device/gemm_s8t_s8n_s32t_tensor_op_s32_sparse_sm80.cu
+++ b/test/unit/gemm/device/gemm_s8t_s8n_s32t_tensor_op_s32_sparse_sm80.cu
@@ -294,7 +294,7 @@ TEST(SM80_Device_Sparse_Gemm_s8t_s8n_s32t_tensor_op_s32, 32x128x256_32x32x256) {
       cutlass::epilogue::thread::LinearCombinationClamp<
           ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
           ElementAccumulator, ElementCompute>,
-      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 6>;
+      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 3>;
 
   EXPECT_TRUE(test::gemm::device::TestAllSparseGemm<Gemm>());
 }
@@ -313,26 +313,7 @@ TEST(SM80_Device_Sparse_Gemm_s8t_s8n_s32t_tensor_op_s32, 32x256x128_32x64x128) {
       cutlass::epilogue::thread::LinearCombinationClamp<
           ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
           ElementAccumulator, ElementCompute>,
-      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 6>;
-
-  EXPECT_TRUE(test::gemm::device::TestAllSparseGemm<Gemm>());
-}
-
-TEST(SM80_Device_Sparse_Gemm_s8t_s8n_s32t_tensor_op_s32, 32x256x256_32x64x256) {
-  using ElementOutput = int32_t;
-  using ElementAccumulator = int32_t;
-  using ElementCompute = int32_t;
-
-  using Gemm = cutlass::gemm::device::GemmSparseUniversal<
-      int8_t, cutlass::layout::RowMajor, int8_t,
-      cutlass::layout::ColumnMajor, ElementOutput, cutlass::layout::RowMajor,
-      ElementAccumulator, cutlass::arch::OpClassTensorOp, cutlass::arch::Sm80,
-      cutlass::gemm::GemmShape<32, 256, 256>,
-      cutlass::gemm::GemmShape<32, 64, 256>, cutlass::gemm::GemmShape<16, 8, 64>,
-      cutlass::epilogue::thread::LinearCombinationClamp<
-          ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
-          ElementAccumulator, ElementCompute>,
-      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 6>;
+      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 4>;
 
   EXPECT_TRUE(test::gemm::device::TestAllSparseGemm<Gemm>());
 }
@@ -351,26 +332,7 @@ TEST(SM80_Device_Sparse_Gemm_s8t_s8n_s32t_tensor_op_s32, 16x128x256_16x32x256) {
       cutlass::epilogue::thread::LinearCombinationClamp<
           ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
           ElementAccumulator, ElementCompute>,
-      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 6>;
-
-  EXPECT_TRUE(test::gemm::device::TestAllSparseGemm<Gemm>());
-}
-
-TEST(SM80_Device_Sparse_Gemm_s8t_s8n_s32t_tensor_op_s32, 16x128x256_32x32x256) {
-  using ElementOutput = int32_t;
-  using ElementAccumulator = int32_t;
-  using ElementCompute = int32_t;
-
-  using Gemm = cutlass::gemm::device::GemmSparseUniversal<
-      int8_t, cutlass::layout::RowMajor, int8_t,
-      cutlass::layout::ColumnMajor, ElementOutput, cutlass::layout::RowMajor,
-      ElementAccumulator, cutlass::arch::OpClassTensorOp, cutlass::arch::Sm80,
-      cutlass::gemm::GemmShape<16, 128, 256>,
-      cutlass::gemm::GemmShape<16, 32, 256>, cutlass::gemm::GemmShape<16, 8, 64>,
-      cutlass::epilogue::thread::LinearCombinationClamp<
-          ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
-          ElementAccumulator, ElementCompute>,
-      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 6>;
+      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 4>;
 
   EXPECT_TRUE(test::gemm::device::TestAllSparseGemm<Gemm>());
 }
@@ -408,7 +370,7 @@ TEST(SM80_Device_Sparse_Gemm_s8t_s8n_s32t_tensor_op_s32, 128x32x256_32x32x256) {
       cutlass::epilogue::thread::LinearCombinationClamp<
           ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
           ElementAccumulator, ElementCompute>,
-      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 6>;
+      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 3>;
 
   EXPECT_TRUE(test::gemm::device::TestAllSparseGemm<Gemm>());
 }
@@ -432,25 +394,6 @@ TEST(SM80_Device_Sparse_Gemm_s8t_s8n_s32t_tensor_op_s32, 256x32x128_64x32x128) {
   EXPECT_TRUE(test::gemm::device::TestAllSparseGemm<Gemm>());
 }
 
-TEST(SM80_Device_Sparse_Gemm_s8t_s8n_s32t_tensor_op_s32, 256x32x256_64x32x256) {
-  using ElementOutput = int32_t;
-  using ElementAccumulator = int32_t;
-  using ElementCompute = int32_t;
-
-  using Gemm = cutlass::gemm::device::GemmSparseUniversal<
-      int8_t, cutlass::layout::RowMajor, int8_t,
-      cutlass::layout::ColumnMajor, ElementOutput, cutlass::layout::RowMajor,
-      ElementAccumulator, cutlass::arch::OpClassTensorOp, cutlass::arch::Sm80,
-      cutlass::gemm::GemmShape<256, 32, 256>,
-      cutlass::gemm::GemmShape<64, 32, 256>, cutlass::gemm::GemmShape<16, 8, 64>,
-      cutlass::epilogue::thread::LinearCombinationClamp<
-          ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
-          ElementAccumulator, ElementCompute>,
-      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 6>;
-
-  EXPECT_TRUE(test::gemm::device::TestAllSparseGemm<Gemm>());
-}
-
 TEST(SM80_Device_Sparse_Gemm_s8t_s8n_s32t_tensor_op_s32, 128x16x128_32x16x128) {
   using ElementOutput = int32_t;
   using ElementAccumulator = int32_t;
@@ -508,25 +451,6 @@ TEST(SM80_Device_Sparse_Gemm_s8t_s8n_s32t_tensor_op_s32, 256x16x128_64x16x128) {
   EXPECT_TRUE(test::gemm::device::TestAllSparseGemm<Gemm>());
 }
 
-TEST(SM80_Device_Sparse_Gemm_s8t_s8n_s32t_tensor_op_s32, 256x16x256_64x16x256) {
-  using ElementOutput = int32_t;
-  using ElementAccumulator = int32_t;
-  using ElementCompute = int32_t;
-
-  using Gemm = cutlass::gemm::device::GemmSparseUniversal<
-      int8_t, cutlass::layout::RowMajor, int8_t,
-      cutlass::layout::ColumnMajor, ElementOutput, cutlass::layout::RowMajor,
-      ElementAccumulator, cutlass::arch::OpClassTensorOp, cutlass::arch::Sm80,
-      cutlass::gemm::GemmShape<256, 16, 256>,
-      cutlass::gemm::GemmShape<64, 16, 256>, cutlass::gemm::GemmShape<16, 8, 64>,
-      cutlass::epilogue::thread::LinearCombinationClamp<
-          ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
-          ElementAccumulator, ElementCompute>,
-      cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 6>;
-
-  EXPECT_TRUE(test::gemm::device::TestAllSparseGemm<Gemm>());
-}
-
 ////////////////////////////////////////////////////////////////////////////////
 
 #endif // defined(CUTLASS_ARCH_SPARSE_MMA_SM80_SUPPORTED)
diff --git a/test/unit/gemm/device/gemm_testbed_3x.hpp b/test/unit/gemm/device/gemm_testbed_3x.hpp
index d7eb5b0769..3391d21315 100644
--- a/test/unit/gemm/device/gemm_testbed_3x.hpp
+++ b/test/unit/gemm/device/gemm_testbed_3x.hpp
@@ -53,6 +53,9 @@
 #include "cutlass/epilogue/collective/default_epilogue.hpp"
 #include "cutlass/epilogue/fusion/operations.hpp"
 #include "cutlass/complex.h"
+#include "cutlass/transform/device/transform_universal_adapter.hpp"
+#include "cutlass/transform/kernel/sparse_gemm_compressor.hpp"
+
 #include "testbed_utils.h"
 
 #include "cutlass/kernel_hardware_info.hpp"
@@ -75,7 +78,7 @@ enum class ScalarLoc {
   ON_DEVICE = 1
 };
 
-enum class VectorBeta {
+enum class VectorScale {
   DISABLED = 0,
   ENABLED = 1
 };
@@ -85,7 +88,44 @@ enum class CheckEquality {
   RELATIVE = 1
 };
 
-namespace detail{
+namespace detail {
+
+inline constexpr auto decomp_mode_to_string =
+  [] (cutlass::gemm::kernel::detail::PersistentTileSchedulerSm90StreamKParams::DecompositionMode mode) -> std::string {
+    using Mode = cutlass::gemm::kernel::detail::PersistentTileSchedulerSm90StreamKParams::DecompositionMode;
+    if (mode == Mode::Heuristic) {
+      return "Heuristic";
+    }
+    else if (mode == Mode::DataParallel) {
+      return "DataParallel";
+    }
+    else if (mode == Mode::SplitK) {
+      return "SplitK";
+    }
+    else if (mode == Mode::StreamK) {
+      return "StreamK";
+    }
+    else {
+      return "Unknown";
+    }
+  };
+
+inline constexpr auto raster_order_to_string =
+  [] (cutlass::gemm::kernel::detail::PersistentTileSchedulerSm90Params::RasterOrderOptions mode) -> std::string {
+    using Mode = cutlass::gemm::kernel::detail::PersistentTileSchedulerSm90Params::RasterOrderOptions;
+    if (mode == Mode::Heuristic) {
+      return "Heuristic";
+    }
+    else if (mode == Mode::AlongM) {
+      return "AlongM";
+    }
+    else if (mode == Mode::AlongN) {
+      return "AlongN";
+    }
+    else {
+      return "Unknown";
+    }
+  };
 
 // Helper classes that take default data type when
 // the Gemm::EpilogueOutputOp does not have ElementCompute
@@ -272,7 +312,8 @@ template<
   class ScheduleType_, 
   class Gemm, 
   class ElementA_ = typename Gemm::GemmKernel::ElementA,
-  class ElementB_ = typename Gemm::GemmKernel::ElementB> 
+  class ElementB_ = typename Gemm::GemmKernel::ElementB,
+  class Enable = void>
 struct HostCollectiveMainloop {
   // Kernel data types
   using ElementA = ElementA_;
@@ -331,6 +372,9 @@ struct HostCollectiveMainloop {
 
   template<class ProblemShapeType>
   bool initialize(ProblemShapeType problem_size) {
+#if (CUTLASS_DEBUG_TRACE_LEVEL > 1)
+    CUTLASS_TRACE_HOST("HostCollectiveMainloop (generic)::initialize(problem_shape)");
+#endif
     //
     // Allocate the GEMM workspace
     //
@@ -349,21 +393,83 @@ struct HostCollectiveMainloop {
     // so the HostTensorB should be treated as KxN in "coord"'s view
     auto b_coord = cutlass::make_Coord(K, N * L);
 
+    try {
+#if (CUTLASS_DEBUG_TRACE_LEVEL > 1)
+      CUTLASS_TRACE_HOST("HostCollectiveMainloop::initialize: tensor_A.resize");
+#endif
+      tensor_A.resize(a_coord, cutlass::layout::Affine2Layout_Factory<LayoutTagA>::layout_factory(a_coord, stride_factor_A));
+#if (CUTLASS_DEBUG_TRACE_LEVEL > 1)
+      CUTLASS_TRACE_HOST("HostCollectiveMainloop::initialize: tensor_B.resize");
+#endif
+      tensor_B.resize(b_coord, cutlass::layout::Affine2Layout_Factory<LayoutTagB>::layout_factory(b_coord, stride_factor_B));
+    }
+    catch (std::exception const& e) {
+      CUTLASS_TRACE_HOST("HostCollectiveMainloop::initialize: tensor A or B resize threw an exception: " << e.what());
+      throw;
+    }
+    catch (...) {
+      CUTLASS_TRACE_HOST("HostCollectiveMainloop::initialize: tensor A or B resize threw an unknown exception");
+      throw;
+    }
 
-    tensor_A.resize(a_coord, cutlass::layout::Affine2Layout_Factory<LayoutTagA>::layout_factory(a_coord, stride_factor_A));
-    tensor_B.resize(b_coord, cutlass::layout::Affine2Layout_Factory<LayoutTagB>::layout_factory(b_coord, stride_factor_B));
-
-    EXPECT_TRUE(initialize_tensor(tensor_A.host_view(), init_A, seed + 2022));
-    EXPECT_TRUE(initialize_tensor(tensor_B.host_view(), init_B, seed + 2021));
+    try {
+      EXPECT_TRUE(initialize_tensor(tensor_A.host_view(), init_A, seed + 2022));
+      EXPECT_TRUE(initialize_tensor(tensor_B.host_view(), init_B, seed + 2021));
+    }
+    catch (cutlass::cuda_exception const& e) {
+      CUTLASS_TRACE_HOST("HostCollectiveMainloop::initialize: checked initialize_tensor threw cutlass::cuda_exception: " << e);
+      throw;
+    }
+    catch (std::exception const& e) {
+      CUTLASS_TRACE_HOST("HostCollectiveMainloop::initialize: checked initialize_tensor threw an exception: " << e.what());
+      throw;
+    }
+    catch (...) {
+      CUTLASS_TRACE_HOST("HostCollectiveMainloop::initialize: checked_initialize_tensor threw an unknown exception");
+      throw;
+    }
 
     // It is possible to randomly initialize to all zeros, so override this with non-zeros
     // in the upper left corner of each operand.
     tensor_A.host_view().at({0, 0}) = ElementA(1);
     tensor_B.host_view().at({0, 0}) = ElementB(1);
 
-    tensor_A.sync_device();
-    tensor_B.sync_device();
+#if (CUTLASS_DEBUG_TRACE_LEVEL > 1)
+    {
+      CUTLASS_TRACE_HOST("HostCollectiveMainloop::initialize: Check last error before sync_device()");
+      cudaError_t error = cudaGetLastError();
+      const auto error_str = cudaGetErrorString(error);
+      CUTLASS_TRACE_HOST("HostCollectiveMainloop::initialize: cudaGetLastError() is " << error_str);
+      CUTLASS_TRACE_HOST("HostCollectiveMainloop::initialize: tensor_A.host_data()=" << tensor_A.host_data() << ", tensor_A.device_data()=" << tensor_A.device_data());
+      CUTLASS_TRACE_HOST("HostCollectiveMainloop::initialize: tensor_B.host_data()=" << tensor_B.host_data() << ", tensor_B.device_data()=" << tensor_B.device_data());
+    }
+#endif
+    try {
+#if (CUTLASS_DEBUG_TRACE_LEVEL > 1)
+      CUTLASS_TRACE_HOST("HostCollectiveMainloop::initialize: tensor_A.sync_device");
+#endif
+      tensor_A.sync_device();
+#if (CUTLASS_DEBUG_TRACE_LEVEL > 1)
+      CUTLASS_TRACE_HOST("HostCollectiveMainloop::initialize: tensor_B.sync_device");
+#endif
+      tensor_B.sync_device();
+    }
+    catch (cutlass::cuda_exception const& e) {
+      CUTLASS_TRACE_HOST("HostCollectiveMainloop::initialize: sync_device() threw cutlass::cuda_exception: " << e);
+      throw;
+    }
+    catch (std::exception const& e) {
+      CUTLASS_TRACE_HOST("HostCollectiveMainloop::initialize: sync_device() threw an exception: " << e.what());
+      throw;
+    }
+    catch (...) {
+      CUTLASS_TRACE_HOST("HostCollectiveMainloop::initialize: sync_device() threw an unknown exception");
+      throw;
+    }
 
+#if (CUTLASS_DEBUG_TRACE_LEVEL > 1)
+    CUTLASS_TRACE_HOST("HostCollectiveMainloop::initialize: Reached end");
+#endif
     return true;
   }
 
@@ -448,6 +554,284 @@ struct HostCollectiveMainloop {
   }
 };
 
+//
+// Sparse MMA host implementation
+//
+template<
+  class Gemm,
+  class ElementA_,
+  class ElementB_>
+struct HostCollectiveMainloopSparse
+{
+
+  // Kernel data types
+  using ElementA = ElementA_;
+  // CuTe layout A for the kernel's sparse tensorA.
+  using LayoutA  = typename Gemm::GemmKernel::CollectiveMainloop::LayoutA;
+  using ElementB = ElementB_;
+  using StrideB  = typename Gemm::GemmKernel::StrideB;
+  using ScheduleType = typename Gemm::GemmKernel::CollectiveMainloop::DispatchPolicy::Schedule;
+
+  using ElementE = typename Gemm::GemmKernel::CollectiveMainloop::ElementE;
+  // CuTe layout E for the kernel's metadata tensor.
+  using LayoutE  = typename Gemm::GemmKernel::CollectiveMainloop::LayoutE;
+  using ElementAccumulator = typename Gemm::GemmKernel::ElementAccumulator;
+  using ElementScalingFactor = ElementAccumulator;
+  using ProblemShapeType = typename Gemm::GemmKernel::ProblemShape;
+  using EpilogueOutputOp = typename Gemm::EpilogueOutputOp;
+  using SparseConfig = typename Gemm::GemmKernel::CollectiveMainloop::SparseConfig;
+
+  // The following typenames are for the reference host tensors. They are non-sparse tensors.
+  using LayoutTagA = decltype(SparseConfig::deduce_layoutA_tag(LayoutA{}));
+  using StrideA = cutlass::gemm::TagToStrideA_t<LayoutTagA>;
+  // We don't care about the actual strideE for the host tensor, but just need one to allocate memory.
+  using StrideE = StrideA;
+
+  // Deduce Cutlass Layouts (RowMajor & ColumnMajor)
+  using LayoutTagB = cutlass::detail::StrideToLayoutTagB_t<StrideB>;
+  using LayoutTagE = cutlass::detail::StrideToLayoutTagA_t<StrideE>;
+
+  using ArchTag = typename Gemm::ArchTag;
+
+  using CompressorUtility = cutlass::transform::kernel::StructuredSparseCompressorUtility<
+                              cute::Shape<int, int, int, int>,
+                              ElementA,
+                              LayoutTagA,
+                              SparseConfig>;
+
+  using CompressorKernel = cutlass::transform::kernel::StructuredSparseCompressor<
+                              cute::Shape<int, int, int, int>,
+                              ElementA,
+                              LayoutTagA,
+                              SparseConfig,
+                              ArchTag>;
+
+  using Compressor = cutlass::transform::device::TransformUniversalAdapter<CompressorKernel>;
+
+  using Arguments = typename Gemm::GemmKernel::MainloopArguments;
+  // Whether to use relative equality checks
+  CheckEquality check_relative_equality = CheckEquality::EXACT;
+
+  // Note: this limitation comes from testbed / not the library
+  static_assert(is_row_or_col_major<StrideA>(),
+    "ERROR : A Layout is neither Row / Column Major)");
+  static_assert(is_row_or_col_major<StrideB>(),
+    "ERROR : B Layout is neither Row / Column Major)");
+
+  StrideA stride_a;
+  StrideA stride_a_compressed;
+  StrideB stride_b;
+  StrideE stride_e;
+
+  LayoutA layout_a;
+  LayoutE layout_e;
+
+  typename LayoutTagA::Stride stride_factor_A;
+  typename LayoutTagB::Stride stride_factor_B;
+  typename LayoutTagE::Stride stride_factor_E;
+
+  cutlass::Distribution::Kind init_A;
+  cutlass::Distribution::Kind init_B;
+
+  cutlass::HostTensor<ElementA, LayoutTagA> tensor_A;
+  cutlass::HostTensor<ElementA, LayoutTagA> tensor_A_Comp;
+  cutlass::HostTensor<ElementB, LayoutTagB> tensor_B;
+  cutlass::HostTensor<ElementE, LayoutTagE> tensor_E;
+  uint64_t seed;
+  static constexpr uint64_t kDefaultSeed = 4096;
+  static constexpr int MaxSmCount = 16;
+
+  HostCollectiveMainloopSparse(
+    CheckEquality check_relative_equality_ = CheckEquality::EXACT,
+    cutlass::Distribution::Kind init_A_ = cutlass::Distribution::Uniform,
+    cutlass::Distribution::Kind init_B_ = cutlass::Distribution::Uniform,
+    uint64_t seed_ = kDefaultSeed,
+    typename LayoutTagA::Stride stride_factor_A_ = typename LayoutTagA::Stride(),
+    typename LayoutTagB::Stride stride_factor_B_ = typename LayoutTagB::Stride(),
+    typename LayoutTagE::Stride stride_factor_E_ = typename LayoutTagE::Stride()
+  ):
+    check_relative_equality(check_relative_equality_),
+    stride_factor_A(stride_factor_A_),
+    stride_factor_B(stride_factor_B_),
+    stride_factor_E(stride_factor_E_),
+    init_A(init_A_), init_B(init_B_), seed(seed_) { }
+
+  template<class ProblemShapeType>
+  bool initialize(ProblemShapeType problem_size) {
+#if (CUTLASS_DEBUG_TRACE_LEVEL > 1)
+    CUTLASS_TRACE_HOST("HostCollectiveMainloopSparse::initialize");
+#endif
+    //
+    // Allocate the GEMM workspace
+    //
+    auto problem_shape_MNKL = cute::append<4>(problem_size, 1);
+    auto M = cute::size<0>(problem_shape_MNKL);
+    auto N = cute::size<1>(problem_shape_MNKL);
+    auto K = cute::size<2>(problem_shape_MNKL);
+    auto L = cute::size<3>(problem_shape_MNKL);
+
+    stride_a = cutlass::make_cute_packed_stride(StrideA{}, cute::make_shape(M, K, L));
+    stride_b = cutlass::make_cute_packed_stride(StrideB{}, cute::make_shape(N, K, L));
+
+    CompressorUtility compressor_utility(problem_shape_MNKL, stride_a);
+
+    // TensorE
+    // In unit of ElementE (uint8_t), after alignment requirement
+    // M-dim: TensorEAtom_M alignment
+    // K-dim: TensorEAtom_K alignment
+    int KAlignedE = compressor_utility.get_metadata_k_physical();
+    int MAlignedE = compressor_utility.get_metadata_m_physical();
+
+    // TensorA Compressed
+    // In unit of ElementARaw, after alignment requirement
+    // M-dim: TMA alignment
+    // K-dim: TMA alignment
+    int KAlignedAC = compressor_utility.get_tensorA_k_physical();
+    int MAlignedAC = compressor_utility.get_tensorA_m_physical();
+
+    stride_a_compressed = cutlass::make_cute_packed_stride(StrideA{}, cute::make_shape(M, KAlignedAC, L));
+    stride_e = cutlass::make_cute_packed_stride(StrideE{}, cute::make_shape(MAlignedE, KAlignedE, L));
+
+    auto a_coord = cutlass::make_Coord(M * L, K);
+    auto b_coord = cutlass::make_Coord(K, N * L);
+    auto e_coord = cutlass::make_Coord(MAlignedE * L, KAlignedE);
+    auto a_comp_coord = cutlass::make_Coord(MAlignedAC * L, KAlignedAC);
+
+    tensor_A.resize(a_coord, cutlass::layout::Affine2Layout_Factory<LayoutTagA>::layout_factory(a_coord, stride_factor_A));
+    tensor_A_Comp.resize(a_comp_coord, cutlass::layout::Affine2Layout_Factory<LayoutTagA>::layout_factory(a_comp_coord, stride_factor_A));
+    tensor_B.resize(b_coord, cutlass::layout::Affine2Layout_Factory<LayoutTagB>::layout_factory(b_coord, stride_factor_B));
+    tensor_E.resize(e_coord, cutlass::layout::Affine2Layout_Factory<LayoutTagE>::layout_factory(e_coord, stride_factor_E));
+
+    EXPECT_TRUE(initialize_tensor(tensor_A.host_view(), init_A, seed + 2022));
+    EXPECT_TRUE(initialize_tensor(tensor_B.host_view(), init_B, seed + 2021));
+
+    // It is possible to randomly initialize to all zeros, so override this with non-zeros
+    // in the upper left corner of each operand.
+    tensor_A.host_view().at({0, 0}) = ElementA(1);
+    tensor_B.host_view().at({0, 0}) = ElementB(1);
+
+    compressor_utility.structure_sparse_zero_mask_fill(tensor_A.host_data(), static_cast<int>(seed + 2023));
+
+    tensor_A.sync_device();
+    tensor_B.sync_device();
+    tensor_E.sync_device();
+    tensor_A_Comp.sync_device();
+
+    cutlass::Status status {cutlass::Status::kSuccess };
+
+    cutlass::KernelHardwareInfo hw_info;
+    hw_info.device_id = 0;
+    hw_info.sm_count = cutlass::KernelHardwareInfo::query_device_multiprocessor_count(hw_info.device_id);
+    typename Compressor::Arguments arguments{
+      {M, N, K, L},
+      {tensor_A.device_data(),
+       stride_a,
+       tensor_A_Comp.device_data(),
+       tensor_E.device_data()},
+      {hw_info}
+    };
+
+    Compressor compressor_op;
+    size_t workspace_size = Compressor::get_workspace_size(arguments);
+    cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+
+    status = compressor_op.can_implement(arguments);
+    if (status != cutlass::Status::kSuccess) {
+      return false;
+    }
+
+    status = compressor_op.initialize(arguments, workspace.get());
+    if (status != cutlass::Status::kSuccess) {
+      return false;
+    }
+
+    status = compressor_op.run();
+
+#if defined(CUTLASS_ENABLE_SYCL)
+    try {
+      syclcompat::wait_and_throw();
+    } catch (std::exception const &e) {
+      ADD_FAILURE() << "Error at Kernel Sync.";
+      return false;
+    }
+#else
+    auto result = cudaDeviceSynchronize();
+    if (result != cudaSuccess) {
+      EXPECT_EQ(result, cudaSuccess) << "Error at Kernel Sync.";
+      return false;
+    }
+#endif
+    layout_a = SparseConfig::fill_layoutA(problem_shape_MNKL);
+    layout_e = SparseConfig::fill_layoutE(problem_shape_MNKL);
+
+    tensor_E.sync_host();
+    tensor_A_Comp.sync_host();
+
+    return true;
+  }
+
+  Arguments to_args() {
+    using ArrayElementA = typename Gemm::GemmKernel::CollectiveMainloop::ArrayElementA;
+    using ArrayElementB = typename Gemm::GemmKernel::CollectiveMainloop::ArrayElementB;
+    return {
+      reinterpret_cast<ArrayElementA *>(tensor_A_Comp.device_data()), layout_a,
+      reinterpret_cast<ArrayElementB *>(tensor_B.device_data()), stride_b,
+      tensor_E.device_data(), layout_e
+    };
+  }
+
+  auto to_host_args(ProblemShapeType problem_size) {
+    using namespace cute;
+    //
+    // Allocate the GEMM workspace
+    //
+    auto problem_shape_MNKL = cute::append<4>(problem_size, 1);
+    auto M = cute::size<0>(problem_shape_MNKL);
+    auto N = cute::size<1>(problem_shape_MNKL);
+    auto K = cute::size<2>(problem_shape_MNKL);
+    auto L = cute::size<3>(problem_shape_MNKL);
+    auto A = make_tensor(make_iterator(tensor_A.host_data()),
+          make_layout(make_shape(M, K, L), stride_a));
+    auto B = make_tensor(make_iterator(tensor_B.host_data()),
+        make_layout(make_shape(N, K, L), stride_b));
+
+    cutlass::reference::host::GettMainloopParams<ElementAccumulator, decltype(A), decltype(B)> mainloop_params{A, B};
+    return mainloop_params;
+  }
+
+  void print_tensors(std::ofstream& file) {
+    file << "A =\n" << tensor_A.host_view()
+         << "\nB =\n" << tensor_B.host_view();
+  }
+
+  bool compare_reference(
+      cute::Shape<int,int,int,int> problem_shape_MNKL) {
+    auto [M, N, K, L] = problem_shape_MNKL;
+
+    EXPECT_GT(cutlass::reference::host::TensorNorm(tensor_A.host_view()), 0);
+    EXPECT_GT(cutlass::reference::host::TensorNorm(tensor_B.host_view()), 0);
+    return true;
+  }
+};
+
+template<
+  class ScheduleType_,
+  class Gemm,
+  class ElementA_,
+  class ElementB_
+>
+struct HostCollectiveMainloop<ScheduleType_, Gemm, ElementA_, ElementB_,
+    cute::enable_if_t<
+      cute::is_same_v<
+        typename Gemm::CollectiveMainloop::DispatchPolicy,
+        cutlass::gemm::MainloopSm90TmaGmmaWarpSpecializedSparse<Gemm::CollectiveMainloop::DispatchPolicy::Stages,
+                                                                typename Gemm::CollectiveMainloop::DispatchPolicy::ClusterShape,
+                                                                ScheduleType_>>>>
+  : HostCollectiveMainloopSparse<Gemm, ElementA_, ElementB_>
+{
+  using HostCollectiveMainloopSparse<Gemm, ElementA_, ElementB_>::HostCollectiveMainloopSparse;
+};
+
 template<class Gemm>
 struct HostCollectiveDefaultEpilogue {
   // fusion types are potentially void if the fusion is not supported
@@ -507,8 +891,8 @@ struct HostCollectiveDefaultEpilogue {
   CheckEquality check_relative_equality = CheckEquality::EXACT;
   // Are scalars copied to device memory before kernel launch
   ScalarLoc use_device_scalars = ScalarLoc::ON_HOST;
-  // If per-row scale is enabled and this is true, beta is passed as a host scalar instead of device vector
-  VectorBeta disable_vector_beta = VectorBeta::DISABLED;
+  // If per-row scale is enabled and this is disabled, alpha/beta are passed as a host or device scalar instead of device vector
+  VectorScale vector_scale_mode = VectorScale::DISABLED;
 
   cutlass::Distribution::Kind init_C;
   uint64_t seed;
@@ -517,7 +901,7 @@ struct HostCollectiveDefaultEpilogue {
   HostCollectiveDefaultEpilogue(
     CheckEquality check_relative_equality_ = CheckEquality::EXACT,
     ScalarLoc use_device_scalars_ = ScalarLoc::ON_HOST,
-    VectorBeta disable_vector_beta_ = VectorBeta::DISABLED,
+    VectorScale vector_scale_mode_ = VectorScale::DISABLED,
     cutlass::Distribution::Kind init_C_ = cutlass::Distribution::Uniform,
     cutlass::Distribution::Kind init_scale_ = cutlass::Distribution::Uniform,
     cutlass::Distribution::Kind init_bias_ = cutlass::Distribution::Uniform,
@@ -529,6 +913,9 @@ struct HostCollectiveDefaultEpilogue {
      use_device_scalars(use_device_scalars_){ }
 
   bool initialize(ProblemShapeType problem_size, ElementScalar alpha_=1.f, ElementScalar beta_=0.f) {
+#if (CUTLASS_DEBUG_TRACE_LEVEL > 1)
+    CUTLASS_TRACE_HOST("HostCollectiveDefaultEpilogue::initialize(problem_size, alpha, beta)");
+#endif
     // Initialize Epilogue tensors
     auto problem_shape_MNKL = cute::append<4>(problem_size, 1);
     auto [M, N, K, L] = problem_shape_MNKL;
@@ -538,15 +925,42 @@ struct HostCollectiveDefaultEpilogue {
 
     // 2.x host tensor does not natively contain a batch stride or coord, so we spoof if by folding it into the outer mode
     auto c_coord = cutlass::make_Coord(M * L, N);
-    tensor_C.resize(c_coord, cutlass::layout::Affine2Layout_Factory<LayoutTagC>::layout_factory(c_coord, stride_factor_C));
-    tensor_D.resize(c_coord, cutlass::layout::Affine2Layout_Factory<LayoutTagD>::layout_factory(c_coord, stride_factor_D));
-    reference_D.resize(c_coord, cutlass::layout::Affine2Layout_Factory<LayoutTagD>::layout_factory(c_coord, stride_factor_D), false);
-    EXPECT_TRUE(initialize_tensor(tensor_C.host_view(), init_C, seed + 2020));
+    try {
+      tensor_C.resize(c_coord, cutlass::layout::Affine2Layout_Factory<LayoutTagC>::layout_factory(c_coord, stride_factor_C));
+      tensor_D.resize(c_coord, cutlass::layout::Affine2Layout_Factory<LayoutTagD>::layout_factory(c_coord, stride_factor_D));
+      reference_D.resize(c_coord, cutlass::layout::Affine2Layout_Factory<LayoutTagD>::layout_factory(c_coord, stride_factor_D), false);
+    }
+    catch (std::exception const& e) {
+      CUTLASS_TRACE_HOST("HostCollectiveDefaultEpilogue::initialize: resizing tensors threw an exception: " << e.what());
+      throw;
+    }
+    catch (...) {
+      CUTLASS_TRACE_HOST("HostCollectiveDefaultEpilogue::initialize: resizing tensors threw an unknown exception");
+      throw;
+    }
+    {
+      const bool init_succeeded = initialize_tensor(tensor_C.host_view(), init_C, seed + 2020);
+      if (not init_succeeded) {
+        CUTLASS_TRACE_HOST("HostCollectiveDefaultEpilogue::initialize: initialize_tensor returned false");
+      }
+      EXPECT_TRUE(init_succeeded);
+    }
     tensor_C.host_view().at({0, 0}) = ElementC(1);
 
     cutlass::reference::host::TensorCopy(reference_D.host_view(), tensor_C.host_view());
-    tensor_C.sync_device();
-    tensor_D.sync_device();
+
+    try {
+      tensor_C.sync_device();
+      tensor_D.sync_device();
+    }
+    catch (std::exception const& e) {
+      CUTLASS_TRACE_HOST("HostCollectiveDefaultEpilogue::initialize: sync_device() threw an exception: " << e.what());
+      throw;
+    }
+    catch (...) {
+      CUTLASS_TRACE_HOST("HostCollectiveDefaultEpilogue::initialize: sync_device() threw an unknown exception");
+      throw;
+    }
 
     alpha = alpha_;
     beta = beta_;
@@ -697,9 +1111,13 @@ struct HostCollectiveEpilogue {
   // FusionOperation derived types/queries
   //
   using EpiloguePolicy = typename Epilogue::DispatchPolicy;
-  static constexpr bool IsLegacy = IsLegacyEpiloguePolicy<EpiloguePolicy>::value;
+  static constexpr bool IsLegacy = detail::IsLegacyEpiloguePolicy<EpiloguePolicy>::value;
 
-  using FusionOp = typename Gemm::EpilogueOutputOp;
+  // FFMA2 SGEMM uses ThreadEpilogueOp for bias and relu support instead of FusionOp, so we compose LinCombPerRowBiasEltAct FusionOp by hand to test the functionality.
+  static constexpr bool IsFfma2Kernel = cute::is_same_v<ScheduleType, cutlass::gemm::KernelMultistage>;
+  using FusionOp = cute::conditional_t<IsFfma2Kernel,
+                                       cutlass::epilogue::fusion::LinCombPerRowBiasEltAct<cutlass::epilogue::thread::Clamp, float, float>,
+                                       typename Gemm::EpilogueOutputOp>;
   static_assert(cute::is_base_of_v<cutlass::epilogue::fusion::FusionOperation, FusionOp>);
 
   using ElementCompute    = typename FusionOp::ElementCompute;
@@ -763,8 +1181,8 @@ struct HostCollectiveEpilogue {
   CheckEquality check_relative_equality = CheckEquality::EXACT;
   // Are scalars copied to device memory before kernel launch
   ScalarLoc use_device_scalars = ScalarLoc::ON_HOST;
-  // If per-row scale is enabled and this is true, beta is passed as a host scalar instead of device vector
-  VectorBeta disable_vector_beta = VectorBeta::DISABLED;
+  // If per-row scale is enabled and this is disabled, alpha/beta are passed as a host or device scalar instead of device vector
+  VectorScale vector_scale_mode = VectorScale::DISABLED;
 
   // Random distribution with which to initialize the A/B/C/D/Aux scaling factors
   cutlass::Distribution::Kind init_scale = cutlass::Distribution::Uniform;
@@ -777,7 +1195,7 @@ struct HostCollectiveEpilogue {
   HostCollectiveEpilogue(
     CheckEquality check_relative_equality_ = CheckEquality::EXACT,
     ScalarLoc use_device_scalars_ = ScalarLoc::ON_HOST,
-    VectorBeta disable_vector_beta_ = VectorBeta::DISABLED,
+    VectorScale vector_scale_mode_ = VectorScale::DISABLED,
     cutlass::Distribution::Kind init_C_ = cutlass::Distribution::Uniform,
     cutlass::Distribution::Kind init_scale_ = cutlass::Distribution::Uniform,
     cutlass::Distribution::Kind init_bias_ = cutlass::Distribution::Uniform,
@@ -790,6 +1208,9 @@ struct HostCollectiveEpilogue {
      use_device_scalars(use_device_scalars_){ }
 
   bool initialize(ProblemShapeType problem_size, ElementScalar alpha_=1.f, ElementScalar beta_=0.f) {
+#if (CUTLASS_DEBUG_TRACE_LEVEL > 1)
+    CUTLASS_TRACE_HOST("HostCollectiveEpilogue::initialize(problem_size, alpha, beta)");
+#endif
     // Initialize Epilogue tensors
     auto problem_shape_MNKL = cute::append<4>(problem_size, 1);
     auto M = cute::size<0>(problem_shape_MNKL);
@@ -802,36 +1223,110 @@ struct HostCollectiveEpilogue {
 
     // 2.x host tensor does not natively contain a batch stride or coord, so we spoof if by folding it into the outer mode
     auto c_coord = cutlass::make_Coord(M * L, N);
-    tensor_C.resize(c_coord, cutlass::layout::Affine2Layout_Factory<LayoutTagC>::layout_factory(c_coord, stride_factor_C));
-    tensor_D.resize(c_coord, cutlass::layout::Affine2Layout_Factory<LayoutTagD>::layout_factory(c_coord, stride_factor_D));
-    reference_D.resize(c_coord, cutlass::layout::Affine2Layout_Factory<LayoutTagD>::layout_factory(c_coord, stride_factor_D), false);
-    EXPECT_TRUE(initialize_tensor(tensor_C.host_view(), init_C, seed + 2020));
+    try {
+      tensor_C.resize(c_coord, cutlass::layout::Affine2Layout_Factory<LayoutTagC>::layout_factory(c_coord, stride_factor_C));
+      tensor_D.resize(c_coord, cutlass::layout::Affine2Layout_Factory<LayoutTagD>::layout_factory(c_coord, stride_factor_D));
+      reference_D.resize(c_coord, cutlass::layout::Affine2Layout_Factory<LayoutTagD>::layout_factory(c_coord, stride_factor_D), false);
+    }
+    catch (std::exception const& e) {
+      CUTLASS_TRACE_HOST("HostCollectiveEpilogue::initialize: resizing tensors threw an exception: " << e.what());
+      throw;
+    }
+    catch (...) {
+      CUTLASS_TRACE_HOST("HostCollectiveEpilogue::initialize: resizing tensors threw an unknown exception");
+      throw;
+    }
+
+    try {
+      const bool initialize_tensor_C_succeeded =
+        initialize_tensor(tensor_C.host_view(), init_C, seed + 2020);
+      if (not initialize_tensor_C_succeeded) {
+        CUTLASS_TRACE_HOST("HostCollectiveEpilogue::initialize: initialize_tensor returned false");
+      }
+      EXPECT_TRUE(initialize_tensor_C_succeeded);
+    }
+    catch (std::exception const& e) {
+      CUTLASS_TRACE_HOST("HostCollectiveEpilogue::initialize: initialize_tensor threw an exception: " << e.what());
+      throw;
+    }
+    catch (...) {
+      CUTLASS_TRACE_HOST("HostCollectiveEpilogue::initialize: initialize_tensor threw an unknown exception");
+      throw;
+    }
+
     tensor_C.host_view().at({0, 0}) = ElementC(1);
 
     cutlass::reference::host::TensorCopy(reference_D.host_view(), tensor_C.host_view());
-    tensor_C.sync_device();
-    tensor_D.sync_device();
+    try {
+      tensor_C.sync_device();
+      tensor_D.sync_device();
+    }
+    catch (std::exception const& e) {
+      CUTLASS_TRACE_HOST("HostCollectiveEpilogue::initialize: sync_device() threw an exception: " << e.what());
+      throw;
+    }
+    catch (...) {
+      CUTLASS_TRACE_HOST("HostCollectiveEpilogue::initialize: sync_device() threw an unknown exception");
+      throw;
+    }
 
     auto scalar_coord = cutlass::make_Coord(1);
     auto col_vector_coord = cutlass::make_Coord(M);
     auto row_vector_coord = cutlass::make_Coord(N);
+    auto batch_vector_coord = cutlass::make_Coord(L);
+    auto ML_coord = cutlass::make_Coord(M * L);
     if constexpr (IsPerRowScaleEnabled) {
-      alpha.resize(col_vector_coord);
-      EXPECT_TRUE(initialize_tensor(alpha.host_view(), init_scale, seed + 2023));
-      if (disable_vector_beta == VectorBeta::DISABLED) {
-        beta.resize(scalar_coord, false);
-        cutlass::reference::host::TensorFill(beta.host_view(), beta_);
+      // scalars
+      if (vector_scale_mode == VectorScale::DISABLED) {
+        // batched scalars
+        if (use_device_scalars == ScalarLoc::ON_DEVICE) {
+          alpha.resize(batch_vector_coord, true);
+          beta.resize(batch_vector_coord, true);
+          EXPECT_TRUE(initialize_tensor(alpha.host_view(), init_scale, seed + 2023));
+          if (beta_ != ElementScalar(0)) {
+            EXPECT_TRUE(initialize_tensor(beta.host_view(), init_scale, seed + 2024));
+          }
+          else {
+            cutlass::reference::host::TensorFill(beta.host_view(), beta_);
+          }
+        }
+        // non-batched scalars
+        else {
+          alpha.resize(scalar_coord, false);
+          beta.resize(scalar_coord, false);
+          cutlass::reference::host::TensorFill(alpha.host_view(), alpha_);
+          cutlass::reference::host::TensorFill(beta.host_view(), beta_);
+        }
       }
+      // batched vectors
       else {
-        beta.resize(col_vector_coord);
-        EXPECT_TRUE(initialize_tensor(beta.host_view(), init_scale, seed + 2024));
+        alpha.resize(ML_coord, true);
+        beta.resize(ML_coord, true);
+        EXPECT_TRUE(initialize_tensor(alpha.host_view(), init_scale, seed + 2023));
+        if (beta_ != ElementScalar(0)) {
+          EXPECT_TRUE(initialize_tensor(beta.host_view(), init_scale, seed + 2024));
+        }
+        else {
+          cutlass::reference::host::TensorFill(beta.host_view(), beta_);
+        }
       }
     }
     else {
-      alpha.resize(scalar_coord, (use_device_scalars == ScalarLoc::ON_DEVICE));
-      beta.resize(scalar_coord, (use_device_scalars == ScalarLoc::ON_DEVICE));
-      cutlass::reference::host::TensorFill(alpha.host_view(), alpha_);
-      cutlass::reference::host::TensorFill(beta.host_view(), beta_);
+      if (use_device_scalars == ScalarLoc::ON_DEVICE) {
+        // Set alpha  beta for different batches.
+        alpha.resize(batch_vector_coord, true);
+        beta.resize(batch_vector_coord, true);
+        cutlass::reference::host::TensorFill(alpha.host_view(), alpha_);
+        for (int l = 0; l < L; ++l) {
+          beta.host_view().at(cutlass::make_Coord(l)) = beta_ + ElementScalar(l);
+        }
+      }
+      else {
+        alpha.resize(scalar_coord, false);
+        beta.resize(scalar_coord, false);
+        cutlass::reference::host::TensorFill(alpha.host_view(), alpha_);
+        cutlass::reference::host::TensorFill(beta.host_view(), beta_);
+      }
     }
     alpha.sync_device();
     beta.sync_device();
@@ -1009,7 +1504,8 @@ struct HostCollectiveEpilogue {
       file << "\n\nvbeta = \n" << beta.host_view();
     } else {
       file
-        << ", alpha: " << alpha.at(coord_0) << ", beta: " << beta.at(coord_0);
+        << "\n\nalpha= \n" << alpha.host_view()
+        << "\n\nbeta= \n " << beta.host_view();
     }
     file << "\n\n";
 
@@ -1059,7 +1555,9 @@ struct HostCollectiveEpilogue {
 
   Arguments to_args(ProblemShapeType problem_size) {
     auto coord_0 = cutlass::make_Coord(0);
-    Arguments arguments = 
+    auto problem_shape_MNKL = cute::append<4>(problem_size, 1);
+    auto [M, N, K, L] = problem_shape_MNKL;
+    Arguments arguments =
       {
         {},
         tensor_C.device_data(), stride_c, tensor_D.device_data(), stride_d
@@ -1080,7 +1578,24 @@ struct HostCollectiveEpilogue {
       fusion_args.alpha = alpha.at(coord_0);
       fusion_args.beta = beta.at(coord_0);
       fusion_args.alpha_ptr = alpha.device_data();
-      fusion_args.beta_ptr = beta.device_data(); // if disable_vector_beta is true this is nullptr
+      fusion_args.beta_ptr = beta.device_data(); // if vector_scale_mode is true this is nullptr
+
+      if constexpr (IsPerRowScaleEnabled) {
+        int32_t m_stride = vector_scale_mode == VectorScale::ENABLED ? 1 : 0;
+        int64_t l_stride = vector_scale_mode == VectorScale::ENABLED ? M : (use_device_scalars == ScalarLoc::ON_DEVICE ? 1 : 0);
+        fusion_args.dAlpha = cute::make_stride(bool(m_stride),cute::_0{}, l_stride);
+        fusion_args.dBeta = cute::make_stride(bool(m_stride),cute::_0{}, l_stride);
+      }
+      else {
+        if constexpr (not IsFfma2Kernel) {
+          if (use_device_scalars == ScalarLoc::ON_DEVICE) {
+            if (L > 1) {
+              fusion_args.dAlpha = cute::make_stride(cute::_0{},cute::_0{}, int64_t(1));
+              fusion_args.dBeta  = cute::make_stride(cute::_0{},cute::_0{}, int64_t(1));
+            }
+          }
+        }
+      }
 
       if constexpr (IsScaleFactorEnabled) {
         fusion_args.scale_a = scale_A.at(coord_0);
@@ -1161,10 +1676,31 @@ struct HostCollectiveEpilogue {
         cute::make_layout(cute::make_shape(IsRowBiasEnabled ? M : N)));
     auto Aux = cute::make_tensor(detail::make_iterator(IsAuxInEnabled ? tensor_Aux.host_data() : reference_Aux.host_data()),
         cute::make_layout(cute::make_shape(M, N, L), stride_Aux));
-    auto Valpha = cute::make_tensor(detail::make_iterator(alpha.host_data()),
-        cute::make_layout(cute::make_shape(M, cute::_1{})));
-    auto Vbeta = cute::make_tensor(detail::make_iterator(beta.host_data()),
-        cute::make_layout(cute::make_shape(M, cute::_1{})));
+    auto Valpha = [&](){
+      if constexpr (IsPerRowScaleEnabled) {
+        int m_stride = vector_scale_mode == VectorScale::ENABLED ? 1 : 0;
+        int l_stride = vector_scale_mode == VectorScale::ENABLED ? M : (use_device_scalars == ScalarLoc::ON_DEVICE ? 1 : 0);
+        return cute::make_tensor(detail::make_iterator(alpha.host_data()),
+            cute::make_layout(cute::make_shape(M, N, L), make_stride(m_stride, cute::_0{}, l_stride)));
+      }
+      else {
+        return cute::make_tensor(detail::make_iterator(alpha.host_data()),
+            cute::make_layout(cute::make_shape(M, N, L), make_stride(cute::_0{}, cute::_0{}, cute::_1{})));
+      }
+    }();
+
+    auto Vbeta = [&]() {
+      if constexpr (IsPerRowScaleEnabled) {
+        int m_stride = vector_scale_mode == VectorScale::ENABLED ? 1 : 0;
+        int l_stride = vector_scale_mode == VectorScale::ENABLED ? M : (use_device_scalars == ScalarLoc::ON_DEVICE ? 1 : 0);
+        return cute::make_tensor(detail::make_iterator(beta.host_data()),
+            cute::make_layout(cute::make_shape(M, N, L), make_stride(m_stride, cute::_0{}, l_stride)));
+      }
+      else {
+        return  cute::make_tensor(detail::make_iterator(beta.host_data()),
+            cute::make_layout(cute::make_shape(M, N, L), make_stride(cute::_0{}, cute::_0{}, cute::_1{})));
+      }
+    }();
     cutlass::reference::host::GettEpilogueParams<
       ElementScalar,
       ElementScalar,
@@ -1218,7 +1754,13 @@ struct HostCollectiveEpilogue {
 
     if constexpr (IsPerRowScaleEnabled) {
       epilogue_params.Valpha = Valpha;
-      if (disable_vector_beta == VectorBeta::ENABLED) {
+      if (vector_scale_mode == VectorScale::ENABLED) {
+        epilogue_params.Vbeta = Vbeta;
+      }
+    }
+    else {
+      if (use_device_scalars == ScalarLoc::ON_DEVICE) {
+        epilogue_params.Valpha = Valpha;
         epilogue_params.Vbeta = Vbeta;
       }
     }
@@ -1270,7 +1812,7 @@ struct TestbedImpl {
   TestbedImpl(
     CheckEquality check_relative_equality_ = CheckEquality::EXACT,
     ScalarLoc use_device_scalars_ = ScalarLoc::ON_HOST,
-    VectorBeta disable_vector_beta_ = VectorBeta::DISABLED,
+    VectorScale vector_scale_mode_ = VectorScale::DISABLED,
     cutlass::Distribution::Kind init_A_ = cutlass::Distribution::Uniform,
     cutlass::Distribution::Kind init_B_ = cutlass::Distribution::Uniform,
     cutlass::Distribution::Kind init_C_ = cutlass::Distribution::Uniform,
@@ -1278,7 +1820,7 @@ struct TestbedImpl {
     cutlass::Distribution::Kind init_bias_ = cutlass::Distribution::Uniform,
     uint64_t seed_ = kDefaultSeed
   ): collective_mma_inputs(HostCollectiveMainloopType(check_relative_equality_, init_A_, init_B_, seed_)), 
-     collective_epilogue(CollectiveEpilogue(check_relative_equality_, use_device_scalars_, disable_vector_beta_, init_C_, init_scale_, init_bias_, seed_)) { }
+     collective_epilogue(CollectiveEpilogue(check_relative_equality_, use_device_scalars_, vector_scale_mode_, init_C_, init_scale_, init_bias_, seed_)) { }
 
   TestbedImpl(
     typename LayoutTagA::Stride stride_factor_A_,
@@ -1287,7 +1829,7 @@ struct TestbedImpl {
     typename LayoutTagD::Stride stride_factor_D_,
     CheckEquality check_relative_equality_ = CheckEquality::EXACT,
     ScalarLoc use_device_scalars_ = ScalarLoc::ON_HOST,
-    VectorBeta disable_vector_beta_ = VectorBeta::DISABLED,
+    VectorScale vector_scale_mode_ = VectorScale::DISABLED,
     cutlass::Distribution::Kind init_A_ = cutlass::Distribution::Uniform,
     cutlass::Distribution::Kind init_B_ = cutlass::Distribution::Uniform,
     cutlass::Distribution::Kind init_C_ = cutlass::Distribution::Uniform,
@@ -1295,10 +1837,13 @@ struct TestbedImpl {
     cutlass::Distribution::Kind init_bias_ = cutlass::Distribution::Uniform,
     uint64_t seed_ = kDefaultSeed
   ): collective_mma_inputs(HostCollectiveMainloopType(check_relative_equality_, stride_factor_A_, stride_factor_B_, init_A_, init_B_, seed_)),
-     collective_epilogue(CollectiveEpilogue(check_relative_equality_, use_device_scalars_, disable_vector_beta_, init_C_, init_scale_, init_bias_, seed_)) { }
+     collective_epilogue(CollectiveEpilogue(check_relative_equality_, use_device_scalars_, vector_scale_mode_, init_C_, init_scale_, init_bias_, seed_)) { }
 
   /// Initializes data structures
   bool initialize(ProblemShapeType problem_size, ElementScalar alpha_=1.f, ElementScalar beta_=0.f) {
+#if (CUTLASS_DEBUG_TRACE_LEVEL > 1)
+    CUTLASS_TRACE_HOST("TestbedImpl::initialize(problem_size, alpha, beta)");
+#endif
     collective_mma_inputs.initialize(problem_size);
     collective_epilogue.initialize(problem_size, alpha_, beta_);
 
@@ -1450,18 +1995,43 @@ struct TestbedImpl {
     DecompositionMode decomposition_mode = DecompositionMode::Heuristic
     )
   {
+#if (CUTLASS_DEBUG_TRACE_LEVEL > 1)
+    CUTLASS_TRACE_HOST("TestbedImpl::run");
+#endif
 
     // Fail test if insufficient CUDA device
     if (!sufficient()) {
+      CUTLASS_TRACE_HOST("TestbedImpl::run: Test failed due to insufficient CUDA device");
       std::cout << "Test failed due to insufficient CUDA device." << std::endl;
       return false;
     }
+#if (CUTLASS_DEBUG_TRACE_LEVEL > 1)
+    else {
+      CUTLASS_TRACE_HOST("TestbedImpl::run: sufficient() returned true");
+    }
+#endif
 
-    if (!this->initialize(problem_size, alpha, beta)) {
-      std::cerr << "Initialization failed \n";
-      return false;
+    try {
+      const bool initialized = this->initialize(problem_size, alpha, beta);
+      if (not initialized) {
+        CUTLASS_TRACE_HOST("TestbedImpl::run: this->initialize returned false");
+        std::cerr << "Initialization failed \n";
+        return false;
+      }
+    }
+    catch (std::exception const& e) {
+      CUTLASS_TRACE_HOST("TestbedImpl::run: this->initialize threw an exception: " << e.what());
+      throw;
+    }
+    catch (...) {
+      CUTLASS_TRACE_HOST("TestbedImpl::run: this->initialize threw an unknown exception");
+      throw;
     }
 
+#if (CUTLASS_DEBUG_TRACE_LEVEL > 1)
+    CUTLASS_TRACE_HOST("TestbedImpl::run: this->initialize() returned true");
+#endif
+
     //
     // Initialize the GEMM operator
     //
@@ -1499,12 +2069,23 @@ struct TestbedImpl {
       scheduler_args
     };
 
-
+#if (CUTLASS_DEBUG_TRACE_LEVEL > 1)
+    CUTLASS_TRACE_HOST("TestbedImpl::run: Creating gemm_op");
+#endif
     Gemm gemm_op;
 
+#if (CUTLASS_DEBUG_TRACE_LEVEL > 1)
+    CUTLASS_TRACE_HOST("TestbedImpl::run: Calling Gemm::get_workspace_size");
+#endif
     size_t workspace_size = Gemm::get_workspace_size(arguments);
+#if (CUTLASS_DEBUG_TRACE_LEVEL > 1)
+    CUTLASS_TRACE_HOST("TestbedImpl::run: Allocating workspace of size " << workspace_size);
+#endif
     cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
 
+#if (CUTLASS_DEBUG_TRACE_LEVEL > 1)
+    CUTLASS_TRACE_HOST("TestbedImpl::run: Calling gemm_op.can_implement");
+#endif
     cutlass::Status status = gemm_op.can_implement(arguments);
 
     if (status != cutlass::Status::kSuccess) {
@@ -1512,6 +2093,8 @@ struct TestbedImpl {
       std::cerr << "This test is not supported." << "\n";
 #else
       cudaError_t error = cudaGetLastError();
+      const auto error_str = cudaGetErrorString(error);
+      CUTLASS_TRACE_HOST("TestbedImpl::run: cudaGetLastError() is " << error_str);
       std::cerr << "This test is not supported: " << cudaGetErrorString(error) << "\n";
 #endif
       return true;
@@ -1522,12 +2105,29 @@ struct TestbedImpl {
     //
 
     if (profiling) {
+#if (CUTLASS_DEBUG_TRACE_LEVEL > 1)
+      CUTLASS_TRACE_HOST("TestbedImpl::run: Calling profile");
+#endif
       return profile(problem_size, static_cast<int>(iterations), gemm_op, arguments, workspace);
     }
     else {
+#if (CUTLASS_DEBUG_TRACE_LEVEL > 1)
+      CUTLASS_TRACE_HOST("TestbedImpl::run: Calling gemm_op.initialize");
+#endif
       status = gemm_op.initialize(arguments, workspace.get());
+      if (status != cutlass::Status::kSuccess) {
+#if defined(CUTLASS_ENABLE_SYCL)
+      std::cerr << "This test is not supported." << "\n";
+#else
+        cudaError_t error = cudaGetLastError();
+        const auto error_str = cudaGetErrorString(error);
+        CUTLASS_TRACE_HOST("TestbedImpl::run: cudaGetLastError() is " << error_str);
+#endif
+      }
+#if (CUTLASS_DEBUG_TRACE_LEVEL > 1)
+      CUTLASS_TRACE_HOST("TestbedImpl::run: Calling gemm_op.run");
+#endif
       status = gemm_op.run();
-
 #if defined(CUTLASS_ENABLE_SYCL)
       try {
         syclcompat::wait_and_throw();
@@ -1536,9 +2136,18 @@ struct TestbedImpl {
         return false;
       }
 #else
+      if (status != cutlass::Status::kSuccess) {
+        cudaError_t error = cudaGetLastError();
+        const auto error_str = cudaGetErrorString(error);
+        CUTLASS_TRACE_HOST("TestbedImpl::run: cudaGetLastError() is " << error_str);
+      }
+#if (CUTLASS_DEBUG_TRACE_LEVEL > 1)
+      CUTLASS_TRACE_HOST("TestbedImpl::run: Calling cudaDeviceSynchronize");
+#endif
       cudaError_t result;
       result = cudaDeviceSynchronize();
       if (result != cudaSuccess) {
+        CUTLASS_TRACE_HOST("TestbedImpl::run: cudaDeviceSynchronize reports non-success");
         EXPECT_EQ(result, cudaSuccess) << "Error at Kernel Sync.";
         return false;
       }
@@ -1548,12 +2157,30 @@ struct TestbedImpl {
       //
       // Verify
       //
+#if (CUTLASS_DEBUG_TRACE_LEVEL > 1)
+      CUTLASS_TRACE_HOST("TestbedImpl::run: Calling this->verify");
+#endif
       bool passed = this->verify(problem_size, alpha, beta);
       if (!passed) {
+        CUTLASS_TRACE_HOST("TestbedImpl::run: this->verify FAILED");
+#if !defined(CUTLASS_ENABLE_SYCL)
+        cudaError_t error = cudaGetLastError();
+        const auto error_str = cudaGetErrorString(error);
+        CUTLASS_TRACE_HOST("TestbedImpl::run: cudaGetLastError() is " << error_str);
+#endif
+
         std::cout << "Error : Failed : with alpha: " << alpha << ", beta: " << beta
                   << "\n";
       }
+#if (CUTLASS_DEBUG_TRACE_LEVEL > 1)
+      else {
+        CUTLASS_TRACE_HOST("TestbedImpl::run: this->verify passed");
+      }
+#endif
 
+#if (CUTLASS_DEBUG_TRACE_LEVEL > 1)
+      CUTLASS_TRACE_HOST("TestbedImpl::run: Reached end");
+#endif
       return passed;
     }
   }
@@ -1601,14 +2228,14 @@ struct Testbed3x {
   Testbed3x(
       CheckEquality check_relative_equality_ = CheckEquality::EXACT,
       ScalarLoc use_device_scalars_ = ScalarLoc::ON_DEVICE,
-      VectorBeta disable_vector_beta_ = VectorBeta::DISABLED,
+      VectorScale vector_scale_mode_ = VectorScale::DISABLED,
       cutlass::Distribution::Kind init_A_ = cutlass::Distribution::Uniform,
       cutlass::Distribution::Kind init_B_ = cutlass::Distribution::Uniform,
       cutlass::Distribution::Kind init_C_ = cutlass::Distribution::Uniform,
       cutlass::Distribution::Kind init_scale_ = cutlass::Distribution::Uniform,
       cutlass::Distribution::Kind init_bias_ = cutlass::Distribution::Uniform,
       uint64_t seed_ = TestBedImpl::kDefaultSeed)
-      : impl_(check_relative_equality_, use_device_scalars_, disable_vector_beta_, init_A_, init_B_, init_C_, init_scale_, init_bias_, seed_) {}
+      : impl_(check_relative_equality_, use_device_scalars_, vector_scale_mode_, init_A_, init_B_, init_C_, init_scale_, init_bias_, seed_) {}
 
   /// Executes one test
   bool run(
@@ -1683,7 +2310,7 @@ bool TestAll(double alpha = 1.0, double beta = 0.0, CheckEquality check_relative
   using ElementScalar = typename Gemm::EpilogueOutputOp::ElementScalar;
   using ProblemShapeType = typename Gemm::GemmKernel::ProblemShape;
 
-  Testbed3x<Gemm, ActivationFunctor> testbed(check_relative_equality, ScalarLoc::ON_HOST, VectorBeta::DISABLED);
+  Testbed3x<Gemm, ActivationFunctor> testbed(check_relative_equality, ScalarLoc::ON_HOST, VectorScale::DISABLED);
 
   int max_alignment = std::max(Gemm::kAlignmentA, Gemm::kAlignmentB);
   std::vector<int> problem_size_m = {max_alignment, 512 - 3 * max_alignment};
@@ -1755,15 +2382,48 @@ bool TestAll(double alpha = 1.0, double beta = 0.0, CheckEquality check_relative
                   problem_size = ProblemShapeType{m, n, k};
                 }
 
-                passed = testbed.run(
-                  problem_size,
-                  cutlass::from_real<ElementScalar>(alpha),
-                  cutlass::from_real<ElementScalar>(beta),
-                  raster_order,
-                  max_swizzle_size,
-                  splits,
-                  decomp_mode
-                );
+                try {
+                  passed = testbed.run(
+                    problem_size,
+                    cutlass::from_real<ElementScalar>(alpha),
+                    cutlass::from_real<ElementScalar>(beta),
+                    raster_order,
+                    max_swizzle_size,
+                    splits,
+                    decomp_mode
+                  );
+                }
+                catch (std::exception const& e) {
+                  EXPECT_TRUE(false) << "TestAll: testbed.run {"
+                    << "m: " << m << ", n: " << n << ", k: " << k
+                    << ", alpha: " << alpha << ", beta: " << beta
+                    << ", raster_order: ???"
+                    << ", max_swizzle_size: " << static_cast<int>(max_swizzle_size)
+                    << ", splits: " << static_cast<int>(splits)
+                    << ", decomp_mode: " << detail::decomp_mode_to_string(decomp_mode)
+                    << "} threw an exception: " << e.what();
+                  throw;
+                }
+                catch (...) {
+                  EXPECT_TRUE(false) << "TestAll: testbed.run {"
+                    << "m: " << m << ", n: " << n << ", k: " << k
+                    << ", alpha: " << alpha << ", beta: " << beta
+                    << ", raster_order: ???"
+                    << ", max_swizzle_size: " << static_cast<int>(max_swizzle_size)
+                    << ", splits: " << static_cast<int>(splits)
+                    << ", decomp_mode: " << detail::decomp_mode_to_string(decomp_mode)
+                    << "} threw an exception (unknown)";
+                  throw;
+                }
+
+                EXPECT_TRUE(passed) << "TestAll: testbed.run {"
+                  << "m: " << m << ", n: " << n << ", k: " << k
+                  << ", alpha: " << alpha << ", beta: " << beta
+                  << ", raster_order: ???"
+                  << ", max_swizzle_size: " << static_cast<int>(max_swizzle_size)
+                  << ", splits: " << static_cast<int>(splits)
+                  << ", decomp_mode: " << detail::decomp_mode_to_string(decomp_mode)
+                  << "} failed";
 
                 if (!passed) {
                   std::cout << __FILE__ << ':' << __LINE__ << " : GEMM MNK " << m << " " << n << " " << k << " FAILED.\n";
@@ -1803,7 +2463,7 @@ bool TestXe(
   using ProblemShapeType = typename Gemm::GemmKernel::ProblemShape;
 
   Testbed3x<Gemm, ActivationFunctor> testbed(
-    check_relative_equality, ScalarLoc::ON_HOST, VectorBeta::DISABLED);
+    check_relative_equality, ScalarLoc::ON_HOST, VectorScale::DISABLED);
 
   // For M & N we test a small and a big size
   // For K, we currently only support K = TileShapeK
diff --git a/test/unit/gemm/device/gemm_testbed_3x_evt.hpp b/test/unit/gemm/device/gemm_testbed_3x_evt.hpp
index 4e75c3e009..d6f0e0b137 100644
--- a/test/unit/gemm/device/gemm_testbed_3x_evt.hpp
+++ b/test/unit/gemm/device/gemm_testbed_3x_evt.hpp
@@ -52,35 +52,23 @@ tapply(T&& t, F&& f, G&& g, cute::seq<I...>)
 /////////////////////////////////////////////////////////////////////////////////////////////////
 /// EVT: Base class for EVT Node
 
-template <
-  typename Gemm_
->
+template < class ElementCompute_ >
 class HostEVTNodeBase {
 public:
-  using Gemm = Gemm_;
-  using TestBedImpl = typename detail::TestbedImpl<Gemm, cutlass::epilogue::thread::Identity, true>;
-  using Kernel = typename Gemm::GemmKernel;
-  using Epilogue = typename Kernel::CollectiveEpilogue;
-  using ElementCompute = typename TestBedImpl::ElementCompute;
-  using ElementScalar = typename TestBedImpl::ElementScalar;
-  using ElementAccumulator = typename Kernel::ElementAccumulator;
-  using ElementC = typename Kernel::ElementC;
-  using ElementD = typename Kernel::ElementD;
+  using ElementCompute = ElementCompute_;
 
-  using LayoutTagC = typename TestBedImpl::LayoutTagC;
-  using LayoutTagD = typename TestBedImpl::LayoutTagD;
 private:
-  bool _check_relative_equality;
+  bool check_relative_equality_;
   // Factors used for calculating relative equality. These default
   // values are borrowed from those used by default in the CUTLASS
   // profiler for performing relative equality checks.
-  float _epsilon = 0.05f;
-  float _nonzero_floor = 1.0f / 256.0f;
+  float epsilon_ = 0.05f;
+  float nonzero_floor_ = 1.0f / 256.0f;
 
 public:
   HostEVTNodeBase(){}
   HostEVTNodeBase(bool check_relative_equality):
-    _check_relative_equality(check_relative_equality) { }
+    check_relative_equality_(check_relative_equality) { }
 
 
   template <
@@ -90,9 +78,9 @@ class HostEVTNodeBase {
   bool equality_check(
     cutlass::TensorView<Element, Layout> const& lhs,
     cutlass::TensorView<Element, Layout> const& rhs) const {
-    if (_check_relative_equality) {
+    if (check_relative_equality_) {
       return cutlass::reference::host::TensorRelativelyEquals(
-        lhs, rhs, Element(_epsilon), Element(_nonzero_floor)
+        lhs, rhs, Element(epsilon_), Element(nonzero_floor_)
       );
     }
     else {
@@ -116,88 +104,109 @@ class HostEVTNodeBase {
 /////////////////////////////////////////////////////////////////////////////////////////////////
 /// EVT - Accumulator
 
-template <
-  typename Gemm
->
-class HostAccumulator: public HostEVTNodeBase<Gemm> {
+template< class ElementCompute = float >
+class HostAccumulator: public HostEVTNodeBase<ElementCompute> {
 public:
-  using Base = HostEVTNodeBase<Gemm>;
-  using TestBedImpl = typename Base::TestBedImpl;
-  using ElementAccumulator = typename Base::ElementAccumulator;
-  using ElementCompute = typename Base::ElementCompute;
+  using Base = HostEVTNodeBase<ElementCompute>;
 
   struct Arguments { };
-
-private:
-  cutlass::NumericConverter<ElementCompute, ElementAccumulator> accumulator_converter;
+  
 public:
   HostAccumulator(){}
   template<typename ProblemShapeType>
-  HostAccumulator(ProblemShapeType problem_size, TestBedImpl impl, bool check_relative_equality=false)
+  HostAccumulator(ProblemShapeType problem_size, bool check_relative_equality = false, int64_t seed = 2024)
     :Base(check_relative_equality) {}
 
+  template<typename ElementAccumulator>
   ElementCompute visit(
     int64_t m, int64_t n, int64_t l, int m_b, int n_b,
     ElementAccumulator acc) {
-    
+    cutlass::NumericConverter<ElementCompute, ElementAccumulator> accumulator_converter;
     return accumulator_converter(acc);
   }
 
   Arguments get_arguments() {
     return Arguments{};
   }
+
+  auto get_flatten_arguments() {
+    return cute::make_tuple();
+  }
 };
 
 /////////////////////////////////////////////////////////////////////////////////////////////////
 /// EVT - Scalar Broadcast
 
 template <
-  typename Gemm,
   int Value,
   int BroadcastCount = 1,
-  template <class> class ReductionFn = cutlass::multiplies
+  class StrideMNL = cute::Stride<cute::_0,cute::_0,cute::_0>,
+  template <class> class ReductionFn = cutlass::multiplies,
+  class ElementCompute = float
 >
-class HostScalarBroadcast : public HostEVTNodeBase<Gemm> {
+class HostScalarBroadcast : public HostEVTNodeBase<ElementCompute> {
 public:
-  using Base = HostEVTNodeBase<Gemm>;
-  using ElementCompute = typename Base::ElementCompute;
 
+  using Base = HostEVTNodeBase<ElementCompute>;
   struct Arguments {
     ElementCompute scalar[BroadcastCount] = {0};
     ElementCompute const* scalar_ptrs[BroadcastCount] = { nullptr };
-    cute::Stride<cute::_0,cute::_0,cute::_0> dScalar{};
+    StrideMNL dScalar[BroadcastCount] = {};
   };
 private:
-  ElementCompute _scalar{};
+  ElementCompute scalar_{};
+  StrideMNL dScalar{};
+  ElementCompute scalar_reduced_{};
 public:
   HostScalarBroadcast(){}
 
-  template<typename ProblemShapeType, typename TestBedImpl>
-  HostScalarBroadcast(ProblemShapeType problem_size, TestBedImpl impl, bool check_relative_equality=false)
-    : Base(check_relative_equality), _scalar(ElementCompute(Value)) {}
+  template<typename ProblemShapeType>
+  HostScalarBroadcast(ProblemShapeType problem_size, bool check_relative_equality = false, int64_t seed = 2024)
+    : Base(check_relative_equality), scalar_(ElementCompute(Value)) {
+    scalar_ = ElementCompute(Value);
+    scalar_reduced_ = scalar_;
+    for (int i = 1; i < BroadcastCount; ++i) {
+      scalar_reduced_ = ReductionFn<ElementCompute>{}(scalar_reduced_, ElementCompute(Value));
+    }
+  }
   
   template <class ElementAccumulator>
   ElementCompute visit(
     int64_t m, int64_t n, int64_t l, int m_b, int n_b,
     ElementAccumulator acc) {
     
-    return _scalar;
+    return scalar_reduced_;
   }
 
   bool compare_reference(std::stringstream& error_ss) {
-    error_ss << "Scalar: " << float(_scalar) << "\n\n";
+    error_ss << "Scalar: " << float(scalar_) << "\n\n";
     return true;
   }
 
   Arguments get_arguments() {
     if constexpr (BroadcastCount == 1)
-      return Arguments{{_scalar}, {nullptr}};
+      return Arguments{{scalar_}, {nullptr}, {dScalar}};
     else if constexpr (BroadcastCount == 2)
-      return Arguments{{_scalar, _scalar}, {nullptr, nullptr}};
+      return Arguments{{scalar_, scalar_}, {nullptr, nullptr}, {dScalar,  dScalar}};
     else if constexpr (BroadcastCount == 3)
-      return Arguments{{_scalar, _scalar, _scalar}, {nullptr, nullptr, nullptr}};
+      return Arguments{{scalar_, scalar_, scalar_}, {nullptr, nullptr, nullptr}, {dScalar, dScalar, dScalar}};
     else
-      return Arguments{{_scalar}, {nullptr}};
+      return Arguments{{scalar_}, {nullptr}, {dScalar}};
+  }
+
+  auto get_flatten_arguments() {
+    if constexpr (BroadcastCount == 1) {
+      return cute::make_tuple(scalar_, nullptr);
+    } 
+    else if constexpr (BroadcastCount == 2) {
+      return cute::make_tuple(scalar_, scalar_, nullptr, nullptr);
+    } 
+    else if constexpr (BroadcastCount == 3) {
+      return cute::make_tuple(scalar_, scalar_, scalar_, nullptr, nullptr, nullptr);
+    } 
+    else {
+      return cute::make_tuple(scalar_, nullptr);
+    }
   }
 };
 
@@ -205,66 +214,65 @@ class HostScalarBroadcast : public HostEVTNodeBase<Gemm> {
 /////////////////////////////////////////////////////////////////////////////////////////////////
 /// EVT - Row Broadcast
 template <
-  typename Gemm,
-  typename ElementBias_=void
+  typename ElementBias_,
+  typename StrideMNL = cute::Stride<cute::_0,cute::_1,cute::_0>,
+  typename ElementCompute = float
 >
-class HostRowBroadcast: public HostEVTNodeBase<Gemm> {
+class HostRowBroadcast: public HostEVTNodeBase<ElementCompute> {
 public:
-  using Base = HostEVTNodeBase<Gemm>;
-  using ElementBias = std::conditional_t<std::is_void_v<ElementBias_>,
-    typename Base::ElementC,
-    ElementBias_>;
-  
-  using TestBedImpl = typename Base::TestBedImpl;
-  using ElementCompute = typename Base::ElementCompute;
+  using Base = HostEVTNodeBase<ElementCompute>;
+  using ElementBias = ElementBias_;
   using LayoutTagVector = cutlass::layout::PackedVectorLayout;
   
   struct Arguments {
     ElementBias const* ptr_row = nullptr;
     ElementBias null_default = ElementBias(0);
-    cute::Stride<cute::_0,cute::_1,cute::_0> dRow = {};
+    StrideMNL dRow = {};
   };
 private:
-  cutlass::NumericConverter<ElementCompute, ElementBias> _bias_converter;
-  cutlass::HostTensor<ElementBias, LayoutTagVector> _bias;
-  int _N;
-  TestBedImpl impl_;
+  cutlass::NumericConverter<ElementCompute, ElementBias> bias_converter_;
+  cutlass::HostTensor<ElementBias, LayoutTagVector> bias_;
+  int N_;
 public:
   HostRowBroadcast(){}
   template<typename ProblemShapeType>
-  HostRowBroadcast(ProblemShapeType problem_size, TestBedImpl impl, bool check_relative_equality=false)
-    : Base(check_relative_equality), impl_(impl) {
+  HostRowBroadcast(ProblemShapeType problem_size, bool check_relative_equality = false, int64_t seed = 2024)
+    : Base(check_relative_equality) {
     auto problem_shape_MNKL = cute::append<4>(problem_size, 1);
-    _N = cute::get<1>(problem_shape_MNKL);
-    _bias.resize(cutlass::Coord<1>(_N));
+    N_ = cute::get<1>(problem_shape_MNKL);
+    bias_.resize(cutlass::Coord<1>(N_));
     
     EXPECT_TRUE(
       detail::initialize_tensor(
-        _bias.host_view(), cutlass::Distribution::Uniform, 
-        impl_.collective_mma_inputs.seed + 2023
+        bias_.host_view(), cutlass::Distribution::Uniform, 
+        seed
       )
     );
-    _bias.sync_device();
+    bias_.sync_device();
   }
 
   template <class ElementAccumulator>
   ElementCompute visit(
     int64_t m, int64_t n, int64_t l, int m_b, int n_b,
     ElementAccumulator acc) {
-    auto TensorBias = cute::make_tensor(_bias.host_data(),
-      cute::make_layout(cute::make_shape(cute::_1{}, _N)));
+    auto TensorBias = cute::make_tensor(bias_.host_data(),
+      cute::make_layout(cute::make_shape(cute::_1{}, N_)));
     
-    return _bias_converter(TensorBias(1, n + n_b));
+    return bias_converter_(TensorBias(1, n + n_b));
   }
 
   bool compare_reference(std::stringstream& error_ss) {
     error_ss
-      << "PerColumnBias = \n" << _bias.host_view() << "\n\n";
+      << "PerColumnBias = \n" << bias_.host_view() << "\n\n";
     return true;
   }
 
   Arguments get_arguments() {
-    return {_bias.device_data()};
+    return {bias_.device_data()};
+  }
+
+  auto get_flatten_arguments() {
+    return cute::make_tuple(bias_.device_data(), ElementBias(0), StrideMNL{});
   }
 
 };
@@ -273,66 +281,65 @@ class HostRowBroadcast: public HostEVTNodeBase<Gemm> {
 /////////////////////////////////////////////////////////////////////////////////////////////////
 /// EVT - Column Broadcast
 template <
-  typename Gemm,
-  typename ElementBias_=void
+  typename ElementBias_,
+  typename StrideMNL = cute::Stride<cute::_1,cute::_0,cute::_0>,
+  typename ElementCompute = float
 >
-class HostColBroadcast: public HostEVTNodeBase<Gemm> {
+class HostColBroadcast: public HostEVTNodeBase<ElementCompute> {
 public:
-  using Base = HostEVTNodeBase<Gemm>;
-  using ElementBias = std::conditional_t<std::is_void_v<ElementBias_>,
-    typename Base::ElementC,
-    ElementBias_>;
-  
-  using TestBedImpl = typename Base::TestBedImpl;
-  using ElementCompute = typename Base::ElementCompute;
+  using Base = HostEVTNodeBase<ElementCompute>;
+  using ElementBias = ElementBias_;
   using LayoutTagVector = cutlass::layout::PackedVectorLayout;
   
   struct Arguments {
     ElementBias const* ptr_row = nullptr;
     ElementBias null_default = ElementBias(0);
-    cute::Stride<cute::_1,cute::_0,cute::_0> dRow = {};
+    StrideMNL dRow = {};
   };
 private:
-  cutlass::NumericConverter<ElementCompute, ElementBias> _bias_converter;
-  cutlass::HostTensor<ElementBias, LayoutTagVector> _bias;
-  int _M;
-  TestBedImpl impl_;
+  cutlass::NumericConverter<ElementCompute, ElementBias> bias_converter_;
+  cutlass::HostTensor<ElementBias, LayoutTagVector> bias_;
+  int M_;
 public:
   HostColBroadcast(){}
   template<typename ProblemShapeType>
-  HostColBroadcast(ProblemShapeType problem_size, TestBedImpl impl, bool check_relative_equality=false)
-    : Base(check_relative_equality), impl_(impl) {
+  HostColBroadcast(ProblemShapeType problem_size, bool check_relative_equality = false, int64_t seed = 2024)
+    : Base(check_relative_equality) {
     auto problem_shape_MNKL = cute::append<4>(problem_size, 1);
-    _M = cute::get<0>(problem_shape_MNKL);
-    _bias.resize(cutlass::Coord<1>(_M));
+    M_ = cute::get<0>(problem_shape_MNKL);
+    bias_.resize(cutlass::Coord<1>(M_));
     
     EXPECT_TRUE(
       detail::initialize_tensor(
-        _bias.host_view(), cutlass::Distribution::Uniform, 
-        impl_.collective_mma_inputs.seed + 2023
+        bias_.host_view(), cutlass::Distribution::Uniform, 
+        seed
       )
     );
-    _bias.sync_device();
+    bias_.sync_device();
   }
 
   template <class ElementAccumulator>
   ElementCompute visit(
     int64_t m, int64_t n, int64_t l, int m_b, int n_b,
     ElementAccumulator acc) {
-    auto TensorBias = cute::make_tensor(_bias.host_data(),
-      cute::make_layout(cute::make_shape(_M, cute::_1{})));
+    auto TensorBias = cute::make_tensor(bias_.host_data(),
+      cute::make_layout(cute::make_shape(M_, cute::_1{})));
     
-    return _bias_converter(TensorBias(m + m_b, 1));
+    return bias_converter_(TensorBias(m + m_b, 1));
   }
 
   bool compare_reference(std::stringstream& error_ss) {
     error_ss
-      << "PerRowBias = \n" << _bias.host_view() << "\n\n";
+      << "PerRowBias = \n" << bias_.host_view() << "\n\n";
     return true;
   }
 
   Arguments get_arguments() {
-    return {_bias.device_data()};
+    return {bias_.device_data()};
+  }
+
+  auto get_flatten_arguments() {
+    return cute::make_tuple(bias_.device_data(), ElementBias(0), StrideMNL{});
   }
 
 };
@@ -341,23 +348,16 @@ class HostColBroadcast: public HostEVTNodeBase<Gemm> {
 /// EVT - Aux Load
 
 template <
-  typename Gemm,
-  bool isC=false,
-  typename ElementAuxLoad_=void,
-  typename LayoutTagAux_=void
+  typename ElementAuxLoad_,
+  typename LayoutTagAux_,
+  bool isC = false,
+  typename ElementCompute = float
 >
-class HostAuxLoad: public HostEVTNodeBase<Gemm> {
+class HostAuxLoad: public HostEVTNodeBase<ElementCompute> {
 public:
-  using ElementAuxLoad = std::conditional_t<std::is_void_v<ElementAuxLoad_>,
-    typename HostEVTNodeBase<Gemm>::ElementC,
-    ElementAuxLoad_>;
-  using LayoutTagAux = std::conditional_t<std::is_void_v<LayoutTagAux_>,
-    typename HostEVTNodeBase<Gemm>::LayoutTagC,
-    LayoutTagAux_>;
-
-  using Base = HostEVTNodeBase<Gemm>;
-  using TestBedImpl = typename Base::TestBedImpl;
-  using ElementCompute = typename Base::ElementCompute;
+  using Base = HostEVTNodeBase<ElementCompute>;
+  using ElementAuxLoad = ElementAuxLoad_;
+  using LayoutTagAux = LayoutTagAux_;
 
   using StrideAux = cutlass::gemm::TagToStrideC_t<LayoutTagAux>;
   struct Arguments_Aux {
@@ -371,23 +371,21 @@ class HostAuxLoad: public HostEVTNodeBase<Gemm> {
   using Arguments = cute::conditional_t<isC, Arguments_C, Arguments_Aux>;
 
 private:
-  cutlass::NumericConverter<ElementCompute, ElementAuxLoad> _aux_load_converter;
-  cutlass::HostTensor<ElementAuxLoad, LayoutTagAux> _tensor_aux_load;
+  cutlass::NumericConverter<ElementCompute, ElementAuxLoad> aux_load_converter_;
+  cutlass::HostTensor<ElementAuxLoad, LayoutTagAux> tensor_aux_load_;
 
-  int _M, _N, _L;
+  int M_, N_, L_;
 
-  TestBedImpl impl_;
-
-  StrideAux _stride_aux;
+  StrideAux stride_aux_;
 public:
   HostAuxLoad(){}
   template<typename ProblemShapeType>
-  HostAuxLoad(ProblemShapeType problem_size, TestBedImpl impl, bool check_relative_equality=false)
-    : Base(check_relative_equality), impl_(impl) {
+  HostAuxLoad(ProblemShapeType problem_size, bool check_relative_equality = false, int64_t seed = 2024)
+    : Base(check_relative_equality) {
     auto problem_shape_NMKL = cute::append<4>(problem_size, 1);
-    auto [_M, _N, K, _L] = problem_shape_NMKL;
-    auto aux_coord = cutlass::make_Coord(_M * _L, _N);
-    _tensor_aux_load.resize(
+    auto [M_, N_, K, L_] = problem_shape_NMKL;
+    auto aux_coord = cutlass::make_Coord(M_ * L_, N_);
+    tensor_aux_load_.resize(
       aux_coord, 
       cutlass::layout::Affine2Layout_Factory<LayoutTagAux>::layout_factory(
         aux_coord, typename LayoutTagAux::Stride()
@@ -395,13 +393,13 @@ class HostAuxLoad: public HostEVTNodeBase<Gemm> {
     );
     EXPECT_TRUE(
       detail::initialize_tensor(
-        _tensor_aux_load.host_view(), 
+        tensor_aux_load_.host_view(), 
         cutlass::Distribution::Uniform, 
-        impl_.collective_mma_inputs.seed + 2023
+        seed
       )
     );
-    _tensor_aux_load.sync_device();
-    _stride_aux = cutlass::make_cute_packed_stride(StrideAux{}, cute::make_shape(_M, _N, _L));
+    tensor_aux_load_.sync_device();
+    stride_aux_ = cutlass::make_cute_packed_stride(StrideAux{}, cute::make_shape(M_, N_, L_));
   }
 
   template <class ElementAccumulator>
@@ -410,23 +408,24 @@ class HostAuxLoad: public HostEVTNodeBase<Gemm> {
     ElementAccumulator acc) {
 
     
-    auto TensorAuxLoad = cute::make_tensor(_tensor_aux_load.host_data(),
-      cute::make_layout(cute::make_shape(_M, _N, _L), _stride_aux));
-    return _aux_load_converter(TensorAuxLoad(m + m_b, n + n_b, l));
+    auto TensorAuxLoad = cute::make_tensor(tensor_aux_load_.host_data(),
+      cute::make_layout(cute::make_shape(M_, N_, L_), stride_aux_));
+    return aux_load_converter_(TensorAuxLoad(m + m_b, n + n_b, l));
   }
 
   bool compare_reference(std::stringstream& error_ss) {
     if constexpr (!isC) {
       error_ss
-        << "AuxLoad = \n" << _tensor_aux_load.host_view()<< "\n\n";
+        << "AuxLoad = \n" << tensor_aux_load_.host_view()<< "\n\n";
     }
     return true;
   }
 
   void* get_tensor_C_ptr() {
     if constexpr (isC) {
-      return static_cast<void*>(_tensor_aux_load.device_data());
-    } else {
+      return static_cast<void*>(tensor_aux_load_.device_data());
+    } 
+    else {
       return nullptr;
     }
   }
@@ -435,7 +434,14 @@ class HostAuxLoad: public HostEVTNodeBase<Gemm> {
     if constexpr (isC)
       return {};
     else
-      return {_tensor_aux_load.device_data(), ElementAuxLoad(0), _stride_aux};
+      return {tensor_aux_load_.device_data(), ElementAuxLoad(0), stride_aux_};
+  }
+
+  auto get_flatten_arguments() {
+    if constexpr (isC)
+      return cute::make_tuple();
+    else
+      return cute::make_tuple(tensor_aux_load_.device_data(), ElementAuxLoad(0), stride_aux_);
   }
 };
 
@@ -456,80 +462,38 @@ T* findNonNullPtr(T* first_ptr, Args... args) {
 }
 
 template <
-  typename Gemm,
-  template <class> class ComputeOp_
+  template <class> class ComputeOp_,
+  typename ElementCompute = float
 >
-class HostCompute: public HostEVTNodeBase<Gemm> {
+class HostCompute: public HostEVTNodeBase<ElementCompute> {
 public:
-  using Base = HostEVTNodeBase<Gemm>;
-  using ElementCompute = typename Base::ElementCompute;
+  using Base = HostEVTNodeBase<ElementCompute>;
   using ComputeOp = ComputeOp_<ElementCompute>;
 
   struct Arguments {
     struct OpArgs {} op;
   };
 private:
-  ComputeOp _op;
+  ComputeOp op_;
 public:
   HostCompute(){}
-  template <typename ProblemShapeType, typename TestBedImpl>
-  HostCompute(ProblemShapeType problem_size, TestBedImpl impl, bool check_relative_equality=false):
+  template <typename ProblemShapeType>
+  HostCompute(ProblemShapeType problem_size, bool check_relative_equality = false, int64_t seed = 2024):
     Base(check_relative_equality) { }
 
   template <class ElementAccumulator, typename... Args>
   ElementCompute visit(
     int64_t m, int64_t n, int64_t l, int m_b, int n_b,
     ElementAccumulator acc, Args... frg_inputs) {
-    return _op(frg_inputs...);
+    return op_(frg_inputs...);
   }
 
   Arguments get_arguments(){
     return {};
   }
-};
-
-/////////////////////////////////////////////////////////////////////////////////////////////////
-/// EVT - Unary Compute
-
-template <
-  typename Gemm,
-  template <class> class ComputeOp_,
-  typename Child0
->
-class HostUnaryCompute: public HostEVTNodeBase<Gemm> {
-public:
-
-  using Base = HostEVTNodeBase<Gemm>;
-  using ElementCompute = typename Base::ElementCompute;
-  using ComputeOp = ComputeOp_<ElementCompute>;
-
-  struct Arguments {
-    typename Child0::Arguments child_0_args; 
-    struct OpArgs {} op;
-  };
-private:
-  ComputeOp _op;
-  Child0 _child_0;
-public:
-  HostUnaryCompute(){}
-  template <typename ProblemShapeType, typename TestBedImpl>
-  HostUnaryCompute(ProblemShapeType problem_size, TestBedImpl impl, bool check_relative_equality=false):
-    Base(check_relative_equality),
-    _child_0(problem_size, impl, check_relative_equality) { }
-
-  template <class ElementAccumulator>
-  ElementCompute visit(
-    int64_t m, int64_t n, int64_t l, int m_b, int n_b,
-    ElementAccumulator acc) {
-    ElementCompute child_0_result = _child_0.visit(m, n, l, m_b, n_b, acc);
-    return _op(child_0_result);
-  }
 
-  Arguments get_arguments(){
-    return {
-      _child_0.get_arguments(),
-      {},
-    };
+  auto get_flatten_arguments() {
+    return cute::make_tuple();
   }
 };
 
@@ -537,23 +501,18 @@ class HostUnaryCompute: public HostEVTNodeBase<Gemm> {
 /// EVT - Aux Store
 
 template <
-  typename Gemm,
-  bool isD=false,
-  class ElementAuxStore_=void,
-  typename LayoutTagAux_=void
+  class ElementAuxStore_,
+  typename LayoutTagAux_,
+  bool isD = false,
+  bool isRelu = false,
+  typename ElementCompute = float
 >
-class HostAuxStore: public HostEVTNodeBase<Gemm> {
+class HostAuxStore: public HostEVTNodeBase<ElementCompute> {
 public:
-  using ElementAuxStore = std::conditional_t<std::is_void_v<ElementAuxStore_>,
-    typename HostEVTNodeBase<Gemm>::ElementD,
-    ElementAuxStore_>;
-  using LayoutTagAux = std::conditional_t<std::is_void_v<LayoutTagAux_>,
-    typename HostEVTNodeBase<Gemm>::LayoutTagD,
-    LayoutTagAux_>;
+  using ElementAuxStore = ElementAuxStore_;
+  using LayoutTagAux = LayoutTagAux_;
 
-  using Base = HostEVTNodeBase<Gemm>;
-  using TestBedImpl = typename Base::TestBedImpl;
-  using ElementCompute = typename Base::ElementCompute;
+  using Base = HostEVTNodeBase<ElementCompute>;
 
   using StrideAux = cutlass::gemm::TagToStrideC_t<LayoutTagAux>;
   struct Arguments_Aux {
@@ -569,36 +528,34 @@ class HostAuxStore: public HostEVTNodeBase<Gemm> {
 
 
 private:
-  cutlass::NumericConverter<ElementAuxStore, ElementCompute> destination_converter;
-  cutlass::HostTensor<ElementAuxStore, LayoutTagAux> _tensor_aux_store;
-  cutlass::HostTensor<ElementAuxStore, LayoutTagAux> _reference_aux_store;
-  int _M, _N, _L;
-  TestBedImpl impl_;
-  StrideAux _stride_aux;
+  cutlass::NumericConverter<ElementAuxStore, ElementCompute> destination_converter_;
+  cutlass::HostTensor<ElementAuxStore, LayoutTagAux> tensor_aux_store_;
+  cutlass::HostTensor<ElementAuxStore, LayoutTagAux> reference_aux_store_;
+  int M_, N_, L_;
+  StrideAux stride_aux_;
 public:
   HostAuxStore(){}
   template <typename ProblemShapeType>
-  HostAuxStore(ProblemShapeType problem_size, TestBedImpl impl, bool check_relative_equality=false):
-    Base(check_relative_equality),
-    impl_(impl) {
+  HostAuxStore(ProblemShapeType problem_size, bool check_relative_equality = false, int64_t seed = 2024):
+    Base(check_relative_equality) {
     auto problem_shape_MNKL = cute::append<4>(problem_size, 1);
-    auto [_M, _N, K, _L] = problem_shape_MNKL;
-    auto aux_coord = cutlass::make_Coord(_M * _L, _N);
-    _tensor_aux_store.resize(
+    auto [M_, N_, K, L_] = problem_shape_MNKL;
+    auto aux_coord = cutlass::make_Coord(M_ * L_, N_);
+    tensor_aux_store_.resize(
       aux_coord, 
       cutlass::layout::Affine2Layout_Factory<LayoutTagAux>::layout_factory(
         aux_coord, typename LayoutTagAux::Stride()
       )
     );
 
-    _reference_aux_store.resize(
+    reference_aux_store_.resize(
       aux_coord,
       cutlass::layout::Affine2Layout_Factory<LayoutTagAux>::layout_factory(
         aux_coord, typename LayoutTagAux::Stride()
       )
     );
-    _tensor_aux_store.sync_device();
-    _stride_aux = cutlass::make_cute_packed_stride(StrideAux{}, cute::make_shape(_M, _N, _L));
+    tensor_aux_store_.sync_device();
+    stride_aux_ = cutlass::make_cute_packed_stride(StrideAux{}, cute::make_shape(M_, N_, L_));
   }
 
   template <class ElementAccumulator>
@@ -606,28 +563,31 @@ class HostAuxStore: public HostEVTNodeBase<Gemm> {
     int64_t m, int64_t n, int64_t l, int m_b, int n_b,
     ElementAccumulator acc, ElementCompute child_0_result) {
 
-    auto TensorAuxStore = cute::make_tensor(static_cast<ElementAuxStore*>(_reference_aux_store.host_data()),
-      cute::make_layout(cute::make_shape(_M, _N, _L), _stride_aux));
-    TensorAuxStore(m + m_b, n + n_b, l) = destination_converter(child_0_result);
+    auto TensorAuxStore = cute::make_tensor(detail::make_iterator(static_cast<ElementAuxStore*>(reference_aux_store_.host_data())),
+      cute::make_layout(cute::make_shape(M_, N_, L_), stride_aux_));
+    if constexpr (isRelu)
+      TensorAuxStore(m + m_b, n + n_b, l) = destination_converter_(child_0_result >= 0);
+    else
+      TensorAuxStore(m + m_b, n + n_b, l) = destination_converter_(child_0_result);
     return child_0_result;
   }
 
   bool compare_reference(std::stringstream& error_ss) {
     // Verify the store node
-    _tensor_aux_store.sync_host();
+    tensor_aux_store_.sync_host();
 
-    bool equal = this->equality_check(_reference_aux_store.host_view(), _tensor_aux_store.host_view());
+    bool equal = this->equality_check(reference_aux_store_.host_view(), tensor_aux_store_.host_view());
     if (!equal) {
       error_ss 
-        << "\n\nReference =\n" << _reference_aux_store.host_view()
-        << "\n\nComputed =\n" << _tensor_aux_store.host_view() << "\n\n";
+        << "\n\nReference =\n" << reference_aux_store_.host_view()
+        << "\n\nComputed =\n" << tensor_aux_store_.host_view() << "\n\n";
     }
     return equal;
   }
 
   void* get_tensor_D_ptr() {
     if constexpr (isD) 
-      return static_cast<void*>(_tensor_aux_store.device_data());
+      return static_cast<void*>(tensor_aux_store_.device_data());
     else
       return nullptr;
   }
@@ -635,8 +595,18 @@ class HostAuxStore: public HostEVTNodeBase<Gemm> {
   Arguments get_arguments() {
     if constexpr (isD) {
       return {};
-    } else {
-      return {_tensor_aux_store.device_data(), _stride_aux};
+    } 
+    else {
+      return {tensor_aux_store_.device_data(), stride_aux_};
+    }
+  }
+
+  auto get_flatten_arguments() {
+    if constexpr (isD) {
+      return cute::make_tuple();
+    } 
+    else {
+      return cute::make_tuple(tensor_aux_store_.device_data(), stride_aux_);
     }
   }
 };
@@ -646,18 +616,22 @@ class HostAuxStore: public HostEVTNodeBase<Gemm> {
 /// EVT - Row Reduce
 
 template <
-  typename Gemm,
   template <class> class ReduceFn,
-  typename ElementReduce
+  typename ElementReduce,
+  bool FinalReduction = true, // Should match the FinalReduction in Device type
+  typename CtaTileShapeMNK = cute::Shape<cute::_1,cute::_1,cute::_1>,
+  typename ElementCompute = float
 >
-class HostRowReduce: public HostEVTNodeBase<Gemm> {
+class HostRowReduce: public HostEVTNodeBase<ElementCompute> {
 public:
-  using Base = HostEVTNodeBase<Gemm>;
-  using TestBedImpl = typename Base::TestBedImpl;
-  using ElementCompute = typename Base::ElementCompute;
-  using ElementOutput = typename Base::ElementD;
+  using Base = HostEVTNodeBase<ElementCompute>;
   using LayoutTagVector = cutlass::layout::PackedVectorLayout;
 
+  using ElementDst = cute::conditional_t<FinalReduction, ElementReduce, ElementCompute>;
+
+  static constexpr int TileM = cute::get<0>(CtaTileShapeMNK{});
+  static constexpr int TileN = cute::get<1>(CtaTileShapeMNK{});
+
   struct Arguments {
     struct OpArgs {
       ElementReduce* ptr_row = nullptr;
@@ -667,64 +641,91 @@ class HostRowReduce: public HostEVTNodeBase<Gemm> {
   };
 
 private:
-  cutlass::NumericConverter<ElementReduce, ElementCompute> destination_converter;
-  cutlass::HostTensor<ElementReduce, LayoutTagVector> _tensor_row_reduce;
-  cutlass::HostTensor<ElementCompute, LayoutTagVector> _reduce_buffer;
-  cutlass::HostTensor<ElementReduce, LayoutTagVector> _reference_row_reduce;
-  int _N;
-  TestBedImpl impl_;
-  ReduceFn<ElementCompute> reduce_fn;
+  cutlass::NumericConverter<ElementReduce, ElementDst> destination_converter_;
+  cutlass::HostTensor<ElementDst, LayoutTagVector> tensor_row_reduce_;
+  cutlass::HostTensor<ElementCompute, LayoutTagVector> reduce_buffer_;
+  cutlass::HostTensor<ElementDst, LayoutTagVector> reference_row_reduce_;
+  int N_;
+  ReduceFn<ElementCompute> reduce_fn_;
+
+  int extent_m_;
+  int extent_n_;
+  int extent_l_;
 public:
   HostRowReduce(){}
   template <typename ProblemShapeType>
-  HostRowReduce(ProblemShapeType problem_size, TestBedImpl impl, bool check_relative_equality=false):
-    Base(check_relative_equality),
-    impl_(impl) {
+  HostRowReduce(ProblemShapeType problem_size, bool check_relative_equality = false, int64_t seed = 2024):
+    Base(check_relative_equality) {
     auto problem_shape_MNKL = cute::append<4>(problem_size, 1);
-    _N = cute::get<1>(problem_shape_MNKL);
-    _tensor_row_reduce.resize(cutlass::Coord<1>(_N));
-    _reference_row_reduce.resize(cutlass::Coord<1>(_N));
-    _reduce_buffer.resize(cutlass::Coord<1>(_N));
+    N_ = cute::get<1>(problem_shape_MNKL);
+    if constexpr (FinalReduction) {
+      tensor_row_reduce_.resize(cutlass::Coord<1>(N_));
+      reference_row_reduce_.resize(cutlass::Coord<1>(N_));
+      reduce_buffer_.resize(cutlass::Coord<1>(N_));
+    } 
+    else {
+      auto NumTile = cute::ceil_div(cute::select<0,1,3>(problem_shape_MNKL), cute::take<0,2>(CtaTileShapeMNK{}));
+      extent_m_ = cute::get<0>(NumTile);
+      extent_n_ = cute::get<1>(NumTile) * TileN;
+      extent_l_ = cute::get<2>(NumTile);
+      auto shape = cutlass::make_Coord(extent_m_ * extent_n_ * extent_l_);
+      tensor_row_reduce_.resize(shape);
+      reference_row_reduce_.resize(shape);
+      reduce_buffer_.resize(shape);
+    }
 
-    _tensor_row_reduce.sync_device();
+    cutlass::reference::host::TensorFill(reduce_buffer_.host_view());
   }
 
   template <class ElementAccumulator>
   ElementCompute visit(
     int64_t m, int64_t n, int64_t l, int m_b, int n_b,
     ElementAccumulator acc, ElementCompute child_0_result) {
-    auto TensorRowReduce = cute::make_tensor(_reduce_buffer.host_data(),
-      cute::make_layout(cute::make_shape(cute::_1{}, _N)));
-    TensorRowReduce(1, n + n_b) = reduce_fn(TensorRowReduce(1, n + n_b), child_0_result);
+    if constexpr (FinalReduction) {
+      auto TensorRowReduce = cute::make_tensor(reduce_buffer_.host_data(),
+      cute::make_layout(cute::make_shape(cute::_1{}, N_)));
+      TensorRowReduce(1, n + n_b) = reduce_fn_(TensorRowReduce(1, n + n_b), child_0_result);
+    } 
+    else {
+      auto TensorRowReduce = cute::make_tensor(
+        reduce_buffer_.host_data(),
+        cute::make_layout(
+          cute::make_shape(extent_m_, extent_n_, extent_l_),
+          cute::make_stride(extent_n_, 1, extent_m_ * extent_l_)
+        )
+      );
+      TensorRowReduce((m+m_b)/TileM, n+n_b, l) = reduce_fn_(TensorRowReduce((m+m_b)/TileM, n+n_b, l), child_0_result);
+    }
+    
     return child_0_result;
   }
 
   bool compare_reference(std::stringstream& error_ss) {
     // Verify the store node
-    _tensor_row_reduce.sync_host();
+    tensor_row_reduce_.sync_host();
 
-    auto TensorRowReduce = cute::make_tensor(_reference_row_reduce.host_data(),
-      cute::make_layout(cute::make_shape(cute::_1{}, _N)));
+    auto TensorRowReduce = cute::make_tensor(reference_row_reduce_.host_data(),
+      cute::make_layout(cute::make_shape(reference_row_reduce_.size())));
     
-    auto TensorReduceBuffer = cute::make_tensor(_reduce_buffer.host_data(),
-      cute::make_layout(cute::make_shape(cute::_1{}, _N)));
+    auto TensorReduceBuffer = cute::make_tensor(reduce_buffer_.host_data(),
+      cute::make_layout(cute::make_shape(reduce_buffer_.size())));
 
     // Filling the reference tensor with the reduce buffer
-    for (int n = 0; n < _N; n ++) {
-      TensorRowReduce(1, n) = destination_converter(TensorReduceBuffer(1, n));
+    for (uint64_t n = 0; n < size(TensorRowReduce); n ++) {
+      TensorRowReduce(n) = destination_converter_(TensorReduceBuffer(n));
     }
 
-    bool equal = this->equality_check(_reference_row_reduce.host_view(), _tensor_row_reduce.host_view());
+    bool equal = this->equality_check(reference_row_reduce_.host_view(), tensor_row_reduce_.host_view());
     if (!equal) {
       error_ss 
-        << "\n\nRow Reduce Reference =\n" << _reference_row_reduce.host_view()
-        << "\n\nRow Reduce Computed =\n" << _tensor_row_reduce.host_view() << "\n\n";
+        << "\n\nRow Reduce Reference =\n" << reference_row_reduce_.host_view()
+        << "\n\nRow Reduce Computed =\n" << tensor_row_reduce_.host_view() << "\n\n";
     }
     return equal;
   }
 
   Arguments get_arguments() {
-    return {_tensor_row_reduce.device_data()};
+    return {tensor_row_reduce_.device_data()};
   }
 };
 
@@ -733,18 +734,22 @@ class HostRowReduce: public HostEVTNodeBase<Gemm> {
 /// EVT - Column Reduce
 
 template <
-  typename Gemm,
   template <class> class ReduceFn,
-  typename ElementReduce
+  typename ElementReduce,
+  bool FinalReduction = true,  // Should match the FinalReduction in Device type
+  typename CtaTileShapeMNK = cute::Shape<cute::_1,cute::_1,cute::_1>,
+  typename ElementCompute = float
 >
-class HostColumnReduce: public HostEVTNodeBase<Gemm> {
+class HostColumnReduce: public HostEVTNodeBase<ElementCompute> {
 public:
-  using Base = HostEVTNodeBase<Gemm>;
-  using TestBedImpl = typename Base::TestBedImpl;
-  using ElementCompute = typename Base::ElementCompute;
-  using ElementOutput = typename Base::ElementD;
+  using Base = HostEVTNodeBase<ElementCompute>;
   using LayoutTagVector = cutlass::layout::PackedVectorLayout;
 
+  using ElementDst = cute::conditional_t<FinalReduction, ElementReduce, ElementCompute>;
+
+  static constexpr int TileM = cute::get<0>(CtaTileShapeMNK{});
+  static constexpr int TileN = cute::get<1>(CtaTileShapeMNK{});
+
   struct Arguments {
     struct OpArgs {
       ElementReduce* ptr_col = nullptr;
@@ -754,64 +759,92 @@ class HostColumnReduce: public HostEVTNodeBase<Gemm> {
   };
 
 private:
-  cutlass::NumericConverter<ElementReduce, ElementCompute> destination_converter;
-  cutlass::HostTensor<ElementReduce, LayoutTagVector> _tensor_column_reduce;
-  cutlass::HostTensor<ElementCompute, LayoutTagVector> _reduce_buffer;
-  cutlass::HostTensor<ElementReduce, LayoutTagVector> _reference_column_reduce;
-  int _M;
-  TestBedImpl impl_;
-  ReduceFn<ElementCompute> reduce_fn;
+  cutlass::NumericConverter<ElementDst, ElementCompute> destination_converter_;
+  cutlass::HostTensor<ElementDst, LayoutTagVector> tensor_column_reduce_;
+  cutlass::HostTensor<ElementCompute, LayoutTagVector> reduce_buffer_;
+  cutlass::HostTensor<ElementDst, LayoutTagVector> reference_column_reduce_;
+  int M_;
+  ReduceFn<ElementCompute> reduce_fn_;
+
+  int extent_m_;
+  int extent_n_;
+  int extent_l_;
 public:
   HostColumnReduce(){}
   template <typename ProblemShapeType>
-  HostColumnReduce(ProblemShapeType problem_size, TestBedImpl impl, bool check_relative_equality=false):
-    Base(check_relative_equality),
-    impl_(impl) {
+  HostColumnReduce(ProblemShapeType problem_size, bool check_relative_equality = false, int64_t seed = 2024):
+    Base(check_relative_equality) {
     auto problem_shape_MNKL = cute::append<4>(problem_size, 1);
-    _M = cute::get<0>(problem_shape_MNKL);
-    _tensor_column_reduce.resize(cutlass::Coord<1>(_M));
-    _reference_column_reduce.resize(cutlass::Coord<1>(_M));
-    _reduce_buffer.resize(cutlass::Coord<1>(_M));
+    M_ = cute::get<0>(problem_shape_MNKL);
 
-    _tensor_column_reduce.sync_device();
+    if constexpr (FinalReduction) {
+      tensor_column_reduce_.resize(cutlass::Coord<1>(M_));
+      reference_column_reduce_.resize(cutlass::Coord<1>(M_));
+      reduce_buffer_.resize(cutlass::Coord<1>(M_));
+    } 
+    else {
+      auto NumTile = cute::ceil_div(cute::select<0,1,3>(problem_shape_MNKL), cute::take<0,2>(CtaTileShapeMNK{}));
+      extent_m_ = cute::get<0>(NumTile) * TileM;
+      extent_n_ = cute::get<1>(NumTile);
+      extent_l_ = cute::get<2>(NumTile);
+      auto shape = cutlass::make_Coord(extent_m_ * extent_n_ * extent_l_);
+      tensor_column_reduce_.resize(shape);
+      reference_column_reduce_.resize(shape);
+      reduce_buffer_.resize(shape);
+    }
+
+    cutlass::reference::host::TensorFill(reduce_buffer_.host_view());
   }
 
   template <class ElementAccumulator>
   ElementCompute visit(
     int64_t m, int64_t n, int64_t l, int m_b, int n_b,
     ElementAccumulator acc, ElementCompute child_0_result) {
-    auto TensorColReduce = cute::make_tensor(_reduce_buffer.host_data(),
-      cute::make_layout(cute::make_shape(_M, cute::_1{})));
-    TensorColReduce(m + m_b, 1) = reduce_fn(TensorColReduce(m + m_b, 1), child_0_result);
+    auto TensorColReduce = cute::make_tensor(reduce_buffer_.host_data(),
+      cute::make_layout(cute::make_shape(M_, cute::_1{})));
+    if constexpr (FinalReduction) {
+      TensorColReduce(m + m_b, 1) = reduce_fn_(TensorColReduce(m + m_b, 1), child_0_result);
+    } 
+    else {
+      auto shape = reduce_buffer_.extent();
+      auto TensorColReduce = cute::make_tensor(
+        reduce_buffer_.host_data(),
+        cute::make_layout(
+          cute::make_shape(extent_m_, extent_n_, extent_l_),
+          cute::make_stride(1, extent_m_, extent_m_ * extent_l_)
+        )
+      );
+      TensorColReduce(m+m_b, (n+n_b)/TileN, l) = reduce_fn_(TensorColReduce(m+m_b, (n+n_b)/TileN, l), child_0_result);
+    }
     return child_0_result;
   }
 
   bool compare_reference(std::stringstream& error_ss) {
     // Verify the store node
-    _tensor_column_reduce.sync_host();
+    tensor_column_reduce_.sync_host();
 
-    auto TensorColReduce = cute::make_tensor(_reference_column_reduce.host_data(),
-      cute::make_layout(cute::make_shape(_M, cute::_1{})));
+    auto TensorColReduce = cute::make_tensor(reference_column_reduce_.host_data(),
+      cute::make_layout(cute::make_shape(reference_column_reduce_.size())));
     
-    auto TensorReduceBuffer = cute::make_tensor(_reduce_buffer.host_data(),
-      cute::make_layout(cute::make_shape(_M, cute::_1{})));
+    auto TensorReduceBuffer = cute::make_tensor(reduce_buffer_.host_data(),
+    cute::make_layout(cute::make_shape(reduce_buffer_.size())));
 
     // Filling the reference tensor with the reduce buffer
-    for (int m = 0; m < _M; m ++) {
-      TensorColReduce(m, 1) = destination_converter(TensorReduceBuffer(m, 1));
+    for (uint64_t m = 0; m < size(TensorColReduce); m ++) {
+      TensorColReduce(m) = destination_converter_(TensorReduceBuffer(m));
     }
 
-    bool equal = this->equality_check(_reference_column_reduce.host_view(), _tensor_column_reduce.host_view());
+    bool equal = this->equality_check(reference_column_reduce_.host_view(), tensor_column_reduce_.host_view());
     if (!equal) {
       error_ss 
-        << "\n\nColumn Reduce Reference =\n" << _reference_column_reduce.host_view()
-        << "\n\nColumn Reduce Computed =\n" << _tensor_column_reduce.host_view() << "\n\n";
+        << "\n\nColumn Reduce Reference =\n" << reference_column_reduce_.host_view()
+        << "\n\nColumn Reduce Computed =\n" << tensor_column_reduce_.host_view() << "\n\n";
     }
     return equal;
   }
 
   Arguments get_arguments() {
-    return {_tensor_column_reduce.device_data()};
+    return {tensor_column_reduce_.device_data()};
   }
 };
 
@@ -820,16 +853,14 @@ class HostColumnReduce: public HostEVTNodeBase<Gemm> {
 /// EVT - Scalar Reduce
 
 template <
-  typename Gemm,
   template <class> class ReduceFn,
-  typename ElementReduce
+  typename ElementReduce,
+  typename ElementCompute = float,
+  bool enabled = true
 >
-class HostScalarReduce: public HostEVTNodeBase<Gemm> {
+class HostScalarReduce: public HostEVTNodeBase<ElementCompute> {
 public:
-  using Base = HostEVTNodeBase<Gemm>;
-  using TestBedImpl = typename Base::TestBedImpl;
-  using ElementCompute = typename Base::ElementCompute;
-  using ElementOutput = typename Base::ElementD;
+  using Base = HostEVTNodeBase<ElementCompute>;
   using LayoutTagVector = cutlass::layout::PackedVectorLayout;
 
   struct Arguments {
@@ -841,59 +872,68 @@ class HostScalarReduce: public HostEVTNodeBase<Gemm> {
   };
 
 private:
-  cutlass::NumericConverter<ElementReduce, ElementCompute> destination_converter;
-  cutlass::HostTensor<ElementReduce, LayoutTagVector> _tensor_scalar_reduce;
-  cutlass::HostTensor<ElementCompute, LayoutTagVector> _reduce_buffer;
-  cutlass::HostTensor<ElementReduce, LayoutTagVector> _reference_scalar_reduce;
-  ReduceFn<ElementCompute> reduce_fn;
-  TestBedImpl impl_;
+  cutlass::NumericConverter<ElementReduce, ElementCompute> destination_converter_;
+  cutlass::HostTensor<ElementReduce, LayoutTagVector> tensor_scalar_reduce_;
+  cutlass::HostTensor<ElementCompute, LayoutTagVector> reduce_buffer_;
+  cutlass::HostTensor<ElementReduce, LayoutTagVector> reference_scalar_reduce_;
+  ReduceFn<ElementCompute> reduce_fn_;
 public:
   HostScalarReduce(){}
   template <typename ProblemShapeType>
-  HostScalarReduce(ProblemShapeType problem_size, TestBedImpl impl, bool check_relative_equality=false):
-    Base(check_relative_equality),
-    impl_(impl) {
-    _tensor_scalar_reduce.resize(cutlass::Coord<1>(1));
-    _reference_scalar_reduce.resize(cutlass::Coord<1>(1));
-    _reduce_buffer.resize(cutlass::Coord<1>(1));
-
-    _tensor_scalar_reduce.sync_device();
+  HostScalarReduce(ProblemShapeType problem_size, bool check_relative_equality = false, int64_t seed = 2024):
+    Base(check_relative_equality) {
+    tensor_scalar_reduce_.resize(cutlass::Coord<1>(1));
+    reference_scalar_reduce_.resize(cutlass::Coord<1>(1));
+    reduce_buffer_.resize(cutlass::Coord<1>(1));
+
+    tensor_scalar_reduce_.sync_device();
+    cutlass::reference::host::TensorFill(reduce_buffer_.host_view());
   }
 
   template <class ElementAccumulator>
   ElementCompute visit(
     int64_t m, int64_t n, int64_t l, int m_b, int n_b,
     ElementAccumulator acc, ElementCompute child_0_result) {
-    auto TensorRowReduce = cute::make_tensor(_reduce_buffer.host_data(),
+    auto TensorRowReduce = cute::make_tensor(reduce_buffer_.host_data(),
       cute::make_layout(cute::make_shape(cute::_1{})));
-    TensorRowReduce(0) = reduce_fn(TensorRowReduce(0), child_0_result);
+    TensorRowReduce(0) = reduce_fn_(TensorRowReduce(0), child_0_result);
     return child_0_result;
   }
 
   bool compare_reference(std::stringstream& error_ss) {
-    // Verify the store node
-    _tensor_scalar_reduce.sync_host();
+    if constexpr (enabled) {
+      // Verify the store node
+      tensor_scalar_reduce_.sync_host();
 
-    auto TensorRowReduce = cute::make_tensor(_reference_scalar_reduce.host_data(),
-      cute::make_layout(cute::make_shape(cute::_1{})));
-    
-    auto TensorReduceBuffer = cute::make_tensor(_reduce_buffer.host_data(),
-      cute::make_layout(cute::make_shape(cute::_1{})));
+      auto TensorRowReduce = cute::make_tensor(reference_scalar_reduce_.host_data(),
+        cute::make_layout(cute::make_shape(cute::_1{})));
+      
+      auto TensorReduceBuffer = cute::make_tensor(reduce_buffer_.host_data(),
+        cute::make_layout(cute::make_shape(cute::_1{})));
 
-    // Filling the reference tensor with the reduce buffer
-    TensorRowReduce(0) = destination_converter(TensorReduceBuffer(0));
+      // Filling the reference tensor with the reduce buffer
+      TensorRowReduce(0) = destination_converter_(TensorReduceBuffer(0));
 
-    bool equal = this->equality_check(_reference_scalar_reduce.host_view(), _tensor_scalar_reduce.host_view());
-    if (!equal) {
-      error_ss 
-        << "\n\nScalar Reduce Reference =\n" << _reference_scalar_reduce.host_view()
-        << "\n\nScalar Reduce Computed =\n" << _tensor_scalar_reduce.host_view() << "\n\n";
+      bool equal = this->equality_check(reference_scalar_reduce_.host_view(), tensor_scalar_reduce_.host_view());
+      if (!equal) {
+        error_ss 
+          << "\n\nScalar Reduce Reference =\n" << reference_scalar_reduce_.host_view()
+          << "\n\nScalar Reduce Computed =\n" << tensor_scalar_reduce_.host_view() << "\n\n";
+      }
+      return equal;
     }
-    return equal;
+    else {
+      return true;
+    }
+    
   }
 
   Arguments get_arguments() {
-    return {_tensor_scalar_reduce.device_data()};
+    return {tensor_scalar_reduce_.device_data()};
+  }
+
+  auto get_flatten_arguments() {
+    return cute::make_tuple(tensor_scalar_reduce_.device_data());
   }
 };
 
@@ -922,11 +962,10 @@ struct ArgumentPack<First, Rest...> {
 
 
 /// Base class for Host Visitor
-template <typename Gemm, class... Ops>
-struct HostVisitorBase: public HostEVTNodeBase<Gemm> {
+template <class ElementCompute, class... Ops>
+struct HostVisitorBase: public HostEVTNodeBase<ElementCompute> {
 public:
-  using Base = HostEVTNodeBase<Gemm>;
-  using ElementCompute = typename Base::ElementCompute;
+  using Base = HostEVTNodeBase<ElementCompute>;
 
   using Arguments_struct = ArgumentPack<typename Ops::Arguments...>;
   using Arguments_tuple = cute::tuple<typename Ops::Arguments...>;
@@ -938,13 +977,13 @@ struct HostVisitorBase: public HostEVTNodeBase<Gemm> {
   std::tuple<Ops...> ops;
 
   HostVisitorBase(){}
-  template<typename ProblemShapeType, typename TestBedImpl>
-  HostVisitorBase(ProblemShapeType problem_size, TestBedImpl impl, bool check_relative_equality=false)
+  template<typename ProblemShapeType>
+  HostVisitorBase(ProblemShapeType problem_size, bool check_relative_equality = false, int64_t seed = 2024)
     :Base(check_relative_equality),
     ops(test::gemm::device::tapply(std::tuple<Ops...>{}, 
       [&] (auto&& op) {
         using Op = cute::remove_cvref_t<decltype(op)>;
-        return Op(problem_size, impl, check_relative_equality);
+        return Op(problem_size, check_relative_equality, seed);
       },
       [] (auto&&... _ops) { 
         return std::make_tuple(_ops...); 
@@ -996,7 +1035,8 @@ struct HostVisitorBase: public HostEVTNodeBase<Gemm> {
       [&] (auto&&... args) {
         if constexpr (Rm1 > 4) {
           return cute::make_tuple(args...);
-        } else {
+        } 
+        else {
           return Arguments(args...);
         }  
       },
@@ -1004,6 +1044,18 @@ struct HostVisitorBase: public HostEVTNodeBase<Gemm> {
     );
   }
 
+  auto get_flatten_arguments() {
+    return test::gemm::device::tapply(ops,
+      [&](auto& op) {
+        return op.get_flatten_arguments();
+      },
+      [&] (auto&&... args) {
+        return flatten(cute::make_tuple(args...));
+      },
+      cute::make_seq<Rm1>{}
+    );
+  }
+
   bool arrayAnd(bool passed) {
     return passed;
   }
@@ -1021,19 +1073,18 @@ struct HostVisitorBase: public HostEVTNodeBase<Gemm> {
 
 /// Tree-struct visitor
 template <class NodeOp, class... ChildOps>
-struct HostTreeVisitor: public HostVisitorBase<typename NodeOp::Base::Gemm, ChildOps..., NodeOp> {
+struct HostTreeVisitor: public HostVisitorBase<typename NodeOp::Base::ElementCompute, ChildOps..., NodeOp> {
 public:
-  using Gemm = typename NodeOp::Base::Gemm;
-  using Base = HostVisitorBase<Gemm, ChildOps..., NodeOp>;
-  using ElementCompute = typename Base::ElementCompute;
+  using ElementCompute = typename NodeOp::Base::ElementCompute;
+  using Base = HostVisitorBase<ElementCompute, ChildOps..., NodeOp>;
   using Arguments = typename Base::Arguments;
   
   constexpr static int Rm1 = sizeof...(ChildOps);
 
   HostTreeVisitor(){}
-  template<typename ProblemShapeType, typename TestBedImpl>
-  HostTreeVisitor(ProblemShapeType problem_size, TestBedImpl impl, bool check_relative_equality=false)
-    :Base(problem_size, impl, check_relative_equality){ }
+  template<typename ProblemShapeType>
+  HostTreeVisitor(ProblemShapeType problem_size, bool check_relative_equality = false, int64_t seed = 2024)
+    :Base(problem_size, check_relative_equality, seed){ }
 
   template <class ElementAccumulator>
   ElementCompute visit(
@@ -1053,30 +1104,29 @@ struct HostTreeVisitor: public HostVisitorBase<typename NodeOp::Base::Gemm, Chil
 
 
 /// General Graph visitor
-template <class Gemm, class EdgeTuple, class... Ops>
-struct HostTopoVisitor: public HostVisitorBase<Gemm, Ops...> {
+template <class ElementCompute, class EdgeTuple, class... Ops>
+struct HostTopoVisitor: public HostVisitorBase<ElementCompute, Ops...> {
 public:
-  using Base = HostVisitorBase<Gemm, Ops...>;
-  using ElementCompute = typename Base::ElementCompute;
+  using Base = HostVisitorBase<ElementCompute, Ops...>;
   constexpr static int Rm1 = Base::Rm1;
   using Arguments = typename Base::Arguments;
   
 private:
-  ElementCompute frg_outputs[Rm1];
+  ElementCompute frg_outputs_[Rm1];
 public:
   HostTopoVisitor(){}
-  template<typename ProblemShapeType, typename TestBedImpl>
-  HostTopoVisitor(ProblemShapeType problem_size, TestBedImpl impl, bool check_relative_equality=false)
-    :Base(problem_size, impl, check_relative_equality) { }
+  template<typename ProblemShapeType>
+  HostTopoVisitor(ProblemShapeType problem_size, bool check_relative_equality = false, int64_t seed = 2024)
+    :Base(problem_size, check_relative_equality, seed) { }
 
   template<class ElementAccumulator, int I>
   ElementCompute visit_(
     int64_t m, int64_t n, int64_t l, int m_b, int n_b,
     ElementAccumulator acc) {
-      frg_outputs[I] = cute::transform_apply(cute::get<I>(EdgeTuple{}),
+      frg_outputs_[I] = cute::transform_apply(cute::get<I>(EdgeTuple{}),
         [&] (auto&& _E) {
           constexpr int e = cute::remove_cvref_t<decltype(_E)>::value;
-          return frg_outputs[e];
+          return frg_outputs_[e];
         },
         [&] (auto const&... frg_inputs) {
           ElementCompute res = std::get<I>(this->ops).visit(m, n, l, m_b, n_b, acc, frg_inputs...);
@@ -1086,8 +1136,9 @@ struct HostTopoVisitor: public HostVisitorBase<Gemm, Ops...> {
 
       if constexpr (I < Rm1 - 1) {
         return visit_<ElementAccumulator, I+1>(m, n, l, m_b, n_b, acc);
-      } else {
-        return frg_outputs[I];
+      } 
+      else {
+        return frg_outputs_[I];
       }
   }
 
@@ -1103,22 +1154,21 @@ struct HostTopoVisitor: public HostVisitorBase<Gemm, Ops...> {
 
 
 /// SplitTree visitor
-template <class Gemm, class InputTree, class OutputTree, class... AuxOutTrees>
-struct HostSplitTreeVisitor: public HostVisitorBase<Gemm, InputTree, AuxOutTrees..., OutputTree> {
+template <class ElementCompute, class InputTree, class OutputTree, class... AuxOutTrees>
+struct HostSplitTreeVisitor: public HostVisitorBase<ElementCompute, InputTree, AuxOutTrees..., OutputTree> {
 public:
-  using Base = HostVisitorBase<Gemm, InputTree, AuxOutTrees..., OutputTree>;
-  using ElementCompute = typename Base::ElementCompute;
+  using Base = HostVisitorBase<ElementCompute, InputTree, AuxOutTrees..., OutputTree>;
   using Arguments = typename Base::Arguments;
 
   constexpr static int Rm2 = sizeof...(AuxOutTrees);
 
 private:
-  ElementCompute frg_input;
+  ElementCompute frg_input_;
 public:
   HostSplitTreeVisitor(){}
-  template<typename ProblemShapeType, typename TestBedImpl>
-  HostSplitTreeVisitor(ProblemShapeType problem_size, TestBedImpl impl, bool check_relative_equality=false)
-    :Base(problem_size, impl, check_relative_equality) { }
+  template<typename ProblemShapeType>
+  HostSplitTreeVisitor(ProblemShapeType problem_size, bool check_relative_equality = false, int64_t seed = 2024)
+    :Base(problem_size, check_relative_equality, seed) { }
 
   template<class ElementAccumulator, int I>
   void visitAux(
@@ -1128,7 +1178,8 @@ struct HostSplitTreeVisitor: public HostVisitorBase<Gemm, InputTree, AuxOutTrees
 
     if constexpr (I < Rm2 - 1) {
       return visitAux<ElementAccumulator, I+1>(m, n, l, m_b, n_b, frag);
-    } else {
+    } 
+    else {
       return;
     }
   }
@@ -1139,12 +1190,248 @@ struct HostSplitTreeVisitor: public HostVisitorBase<Gemm, InputTree, AuxOutTrees
     ElementAccumulator acc) {
     
     /// Compute the input tree
-    frg_input = std::get<0>(this->ops).visit(m, n, l, m_b, n_b, acc);
+    frg_input_ = std::get<0>(this->ops).visit(m, n, l, m_b, n_b, acc);
 
     /// Compute the aux out tree
-    visitAux<ElementAccumulator, 0>(m, n, l, m_b, n_b, frg_input);
+    visitAux<ElementAccumulator, 0>(m, n, l, m_b, n_b, frg_input_);
     /// Visit the output tree
-    return std::get<Rm2+1>(this->ops).visit(m, n, l, m_b, n_b, frg_input);
+    return std::get<Rm2+1>(this->ops).visit(m, n, l, m_b, n_b, frg_input_);
+  }
+};
+
+/// Universal testbed for EVT w/o smem
+template <class Gemm, typename EVT, bool FlatArgs = false>
+class Testbed3xEVTnoSmem {
+public:
+  // The EVT Module to test
+  using EVTModule = EVT; //typename EVT::EVTModule;
+
+  using TestBedImpl = typename detail::TestbedImpl<Gemm, cutlass::epilogue::thread::Identity, true>;
+  using Kernel = typename Gemm::GemmKernel;
+  using Epilogue = typename Gemm::GemmKernel::CollectiveEpilogue;
+  using ElementAccumulator = typename Kernel::ElementAccumulator;
+  using ElementC = typename Kernel::ElementC;
+  using ElementD = typename Kernel::ElementD;
+
+  using ProblemShapeType = typename Kernel::ProblemShape;
+
+  using LayoutTagA = typename TestBedImpl::LayoutTagA;
+  using LayoutTagB = typename TestBedImpl::LayoutTagB;
+
+  using RasterOrderOptions = typename cutlass::gemm::kernel::detail::PersistentTileSchedulerSm90::RasterOrderOptions;
+  using DecompositionMode = typename cutlass::gemm::kernel::detail::PersistentTileSchedulerSm90StreamKParams::DecompositionMode;
+
+  //
+  // Methods
+  //
+  Testbed3xEVTnoSmem(
+      bool check_relative_equality_,
+      cutlass::Distribution::Kind init_A_ = cutlass::Distribution::Uniform,
+      cutlass::Distribution::Kind init_B_ = cutlass::Distribution::Uniform,
+      uint64_t seed_ = TestBedImpl::kDefaultSeed ) :
+    impl_((check_relative_equality_ ? CheckEquality::RELATIVE : CheckEquality::EXACT), ScalarLoc::ON_DEVICE, VectorScale::ENABLED,
+          init_A_, init_B_, cutlass::Distribution::Uniform, cutlass::Distribution::Uniform, cutlass::Distribution::Uniform, seed_),
+          check_relative_equality(check_relative_equality_) { }
+
+  Testbed3xEVTnoSmem(
+      cutlass::Distribution::Kind init_A_ = cutlass::Distribution::Uniform,
+      cutlass::Distribution::Kind init_B_ = cutlass::Distribution::Uniform,
+      uint64_t seed_ = TestBedImpl::kDefaultSeed ) :
+    impl_(CheckEquality::EXACT, ScalarLoc::ON_DEVICE, VectorScale::ENABLED,
+          init_A_, init_B_, cutlass::Distribution::Uniform, cutlass::Distribution::Uniform, cutlass::Distribution::Uniform, seed_),
+          check_relative_equality(false)  { }
+  
+  /// Initializes data structures
+  void initialize(ProblemShapeType problem_size) {
+    //
+    // Allocate the GEMM workspace for A/B tensor
+    //
+    impl_.initialize(problem_size);
+  }
+  // Detail Implementation
+  TestBedImpl impl_;
+  
+  // Whether to use relative equality checks
+  bool check_relative_equality;
+  
+  bool verify(ProblemShapeType problem_size, EVTModule& host_reference) {
+    
+    auto problem_shape_MNKL = cute::append<4>(problem_size, 1);
+    auto M = cute::get<0>(problem_shape_MNKL);
+    auto N = cute::get<1>(problem_shape_MNKL);
+    auto K = cute::get<2>(problem_shape_MNKL);
+    auto L = cute::get<3>(problem_shape_MNKL);
+
+    auto A = cute::make_tensor(impl_.collective_mma_inputs.tensor_A.host_data(),
+      cute::make_layout(cute::make_shape(M, K, L), impl_.collective_mma_inputs.stride_a));
+    auto B = cute::make_tensor(impl_.collective_mma_inputs.tensor_B.host_data(),
+      cute::make_layout(cute::make_shape(N, K, L), impl_.collective_mma_inputs.stride_b));
+    auto LayoutD = cute::make_layout(cute::make_shape(M, N, L), impl_.collective_epilogue.stride_d);
+
+    cutlass::reference::host::GettMainloopParams<ElementAccumulator, decltype(A), decltype(B)> mainloop_params{A, B};
+
+    /// Reference Kernel
+    static int constexpr kBlockM = 64;
+    static int constexpr kBlockN = 64;
+
+#if defined(_OPENMP)
+    #pragma omp parallel for collapse(3)
+#endif
+    for (int64_t l = 0; l < cute::size<2>(mainloop_params.A.layout()); ++l) {
+      for (int64_t m = 0; m < cute::size<0>(mainloop_params.A.layout()); m += kBlockM) {
+        for (int64_t n = 0; n < cute::size<0>(mainloop_params.B.layout()); n += kBlockN) {
+          ElementAccumulator acc[kBlockM][kBlockN];
+          gett_mainloop(mainloop_params, m, n, l, acc);
+          /// Epilogue EVT
+          for (int n_b = 0; n_b < kBlockN; ++n_b) {
+            for (int m_b = 0; m_b < kBlockM; ++m_b) {
+              if (m + m_b < cute::size<0>(LayoutD) && n + n_b < cute::size<1>(LayoutD)) {
+                host_reference.visit(m, n, l, m_b, n_b, acc[m_b][n_b]);
+              }
+            }
+          }
+        }
+      }
+    }
+
+    std::stringstream error_ss;
+    bool passed = host_reference.compare_reference(error_ss);
+    if (!passed) {
+      std::stringstream fname;
+      fname << "error_Gemm_device_"
+        << M << "x" << N << "x" << K << "x" << L << "_"
+        << cute::get<0>(typename Gemm::GemmKernel::TileShape{}) << "_"
+        << cute::get<1>(typename Gemm::GemmKernel::TileShape{}) << "_"
+        << cute::get<2>(typename Gemm::GemmKernel::TileShape{}) << ".txt";
+      
+      std::ofstream file(fname.str());
+      file
+        << "problem: " << ' ' << M << "x" << N << "x" << K
+        << ", Batch count = " << L << "\n\n";
+      
+      file
+        << "A =\n" << impl_.collective_mma_inputs.tensor_A.host_view()
+        << "\nB =\n" << impl_.collective_mma_inputs.tensor_B.host_view();
+      
+      file << error_ss.str();
+    }
+
+    return passed;
+  }
+
+  bool run(
+    ProblemShapeType problem_size,
+    RasterOrderOptions raster_order = RasterOrderOptions::Heuristic,
+    detail::MaxSwizzleSize max_swizzle = detail::MaxSwizzleSize{},
+    detail::Splits splits = detail::Splits{},
+    DecompositionMode decomposition_mode = DecompositionMode::Heuristic,
+    int iterations = 20,
+    bool profiling = false) {   
+    // Fail test if insufficient CUDA device
+    if (!impl_.sufficient()) {
+      std::cout << "Test failed due to insufficient CUDA device." << std::endl;
+      return false;
+    }
+    //
+    // Initialize the Gemm operator
+    //
+
+    typename Gemm::Arguments arguments;
+    cutlass::KernelHardwareInfo hw_info;
+    hw_info.device_id = 0;
+    if (not profiling) {
+      impl_.sm_count = std::min(impl_.MaxSmCount, cutlass::KernelHardwareInfo::query_device_multiprocessor_count(hw_info.device_id));
+      hw_info.sm_count = impl_.sm_count;
+    }
+    else {
+      impl_.sm_count = cutlass::KernelHardwareInfo::query_device_multiprocessor_count(hw_info.device_id);
+      hw_info.sm_count = impl_.sm_count;
+    }
+
+    typename Gemm::GemmKernel::TileScheduler::Arguments scheduler_args;
+    if constexpr (cute::is_same_v<typename Gemm::GemmKernel::TileSchedulerTag, cutlass::gemm::StreamKScheduler>) {
+      scheduler_args = { static_cast<int>(splits), static_cast<int>(max_swizzle), raster_order, decomposition_mode };
+    }
+    else {
+      scheduler_args = { static_cast<int>(max_swizzle), raster_order };
+    }
+
+    /// Initializes data structures
+    /// A/B/C/D Tensor
+    initialize(problem_size);
+
+    /// Initialize the epilogue arguments
+    EVTModule host_reference(problem_size, check_relative_equality, 2024);
+
+    arguments = typename Gemm::Arguments{
+      cutlass::gemm::GemmUniversalMode::kGemm,
+      problem_size,
+      {
+        impl_.collective_mma_inputs.tensor_A.device_data(), impl_.collective_mma_inputs.stride_a,
+        impl_.collective_mma_inputs.tensor_B.device_data(), impl_.collective_mma_inputs.stride_b
+      },
+      {},
+      hw_info,
+      scheduler_args
+    };
+
+    // Filling in the thread arguments
+    if constexpr (FlatArgs) {
+      auto epilogue_args = host_reference.get_flatten_arguments();
+      std::memcpy(&arguments.epilogue.thread, &epilogue_args, sizeof(epilogue_args));
+
+      arguments.epilogue.ptr_C = static_cast<ElementC*>(host_reference.get_tensor_C_ptr());
+      arguments.epilogue.dC = impl_.collective_epilogue.stride_c;
+
+      arguments.epilogue.ptr_D = static_cast<ElementD*>(host_reference.get_tensor_D_ptr());
+      arguments.epilogue.dD = impl_.collective_epilogue.stride_d;
+    } 
+    else {
+      auto epilogue_args = host_reference.get_arguments();
+      std::memcpy(&arguments.epilogue, &epilogue_args, sizeof(epilogue_args));
+    }
+
+    Gemm gemm_op;
+
+    size_t workspace_size = Gemm::get_workspace_size(arguments);
+    cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+
+    cutlass::Status status = gemm_op.can_implement(arguments);
+
+    if (status != cutlass::Status::kSuccess) {
+      cudaError_t error = cudaGetLastError();
+      std::cerr << "This test is not supported: " << cudaGetErrorString(error) << "\n";
+      return true;
+    }
+    
+    //
+    // Run the GEMM
+    //
+    if (profiling) {
+      return impl_.profile(problem_size, iterations, gemm_op, arguments, workspace);
+    }
+    else {
+      cudaError_t result;
+      status = gemm_op.initialize(arguments, workspace.get());
+      status = gemm_op.run();
+      result = cudaDeviceSynchronize();
+      if (result != cudaSuccess) {
+        EXPECT_EQ(result, cudaSuccess) << "Error at Kernel Sync.";
+        return false;
+      }
+    }
+
+    EXPECT_TRUE(status == cutlass::Status::kSuccess) << to_string(status);
+
+    //
+    // Verify
+    //
+    bool passed = this->verify(problem_size, host_reference);
+    if (!passed) {
+      std::cout << "Error : Failed \n";
+    }
+
+    return passed;
   }
 };
 
@@ -1179,7 +1466,7 @@ class Testbed3xEVT {
     cutlass::Distribution::Kind init_C_ = cutlass::Distribution::Uniform,
     uint64_t seed_ = TestBedImpl::kDefaultSeed
   ) :
-     impl_((check_relative_equality_ ? CheckEquality::RELATIVE : CheckEquality::EXACT), ScalarLoc::ON_DEVICE, VectorBeta::ENABLED,
+     impl_((check_relative_equality_ ? CheckEquality::RELATIVE : CheckEquality::EXACT), ScalarLoc::ON_DEVICE, VectorScale::ENABLED,
            init_A_, init_B_, init_C_, cutlass::Distribution::Uniform, cutlass::Distribution::Uniform, seed_),
            check_relative_equality(check_relative_equality_) { }
 
@@ -1189,7 +1476,7 @@ class Testbed3xEVT {
     cutlass::Distribution::Kind init_C_ = cutlass::Distribution::Uniform,
     uint64_t seed_ = TestBedImpl::kDefaultSeed
   ) :
-     impl_(CheckEquality::EXACT, ScalarLoc::ON_DEVICE, VectorBeta::ENABLED,
+     impl_(CheckEquality::EXACT, ScalarLoc::ON_DEVICE, VectorScale::ENABLED,
            init_A_, init_B_, init_C_, cutlass::Distribution::Uniform, cutlass::Distribution::Uniform, seed_),
            check_relative_equality(false)  { }
 
@@ -1204,7 +1491,7 @@ class Testbed3xEVT {
     uint64_t seed_ = TestBedImpl::kDefaultSeed
   ) :
     impl_(stride_factor_A_, stride_factor_B_, stride_factor_C_, stride_factor_D_,
-          CheckEquality::EXACT, ScalarLoc::ON_DEVICE, VectorBeta::ENABLED,
+          CheckEquality::EXACT, ScalarLoc::ON_DEVICE, VectorScale::ENABLED,
           init_A_, init_B_, init_C_, cutlass::Distribution::Uniform, cutlass::Distribution::Uniform, seed_),
           check_relative_equality(false)  { }
   
@@ -1323,7 +1610,7 @@ class Testbed3xEVT {
     initialize(problem_size);
 
     /// Initialize the epilogue arguments
-    EVTModule host_reference(problem_size, impl_, check_relative_equality);
+    EVTModule host_reference(problem_size, check_relative_equality, 2024);
 
     arguments = typename Gemm::Arguments{
       cutlass::gemm::GemmUniversalMode::kGemm,
@@ -1391,9 +1678,8 @@ class Testbed3xEVT {
   }
 };
 
-
 template <typename Gemm, typename EVT>
-bool TestAllEVT(bool check_relative_equality=false) {
+bool TestAllEVT(bool check_relative_equality = false) {
   using ProblemShapeType = typename Gemm::GemmKernel::ProblemShape;
 
   int max_alignment = std::max(Gemm::kAlignmentA, Gemm::kAlignmentB);
diff --git a/test/unit/gemm/device/gemm_testbed_3x_ptr_array.hpp b/test/unit/gemm/device/gemm_testbed_3x_ptr_array.hpp
index e2d3f2d06a..b7d1c57923 100644
--- a/test/unit/gemm/device/gemm_testbed_3x_ptr_array.hpp
+++ b/test/unit/gemm/device/gemm_testbed_3x_ptr_array.hpp
@@ -76,7 +76,7 @@ enum class ScalarLoc {
   ON_DEVICE = 1
 };
 
-enum class VectorBeta {
+enum class VectorScale {
   DISABLED = 0,
   ENABLED = 1
 };
@@ -556,8 +556,8 @@ struct HostCollectiveDefaultEpilogue {
   CheckEquality check_relative_equality = CheckEquality::EXACT;
   // Are scalars copied to device memory before kernel launch
   ScalarLoc use_device_scalars = ScalarLoc::ON_HOST;
-  // If per-row scale is enabled and this is true, beta is passed as a host scalar instead of device vector
-  VectorBeta disable_vector_beta = VectorBeta::DISABLED;
+  // If per-row scale is enabled and this is disabled, alpha/beta are passed as a host or device scalar instead of device vector
+  VectorScale vector_scale_mode = VectorScale::DISABLED;
 
   cutlass::Distribution::Kind init_C;
   uint64_t seed;
@@ -566,7 +566,7 @@ struct HostCollectiveDefaultEpilogue {
   HostCollectiveDefaultEpilogue(
     CheckEquality check_relative_equality_ = CheckEquality::EXACT,
     ScalarLoc use_device_scalars_ = ScalarLoc::ON_HOST,
-    VectorBeta disable_vector_beta_ = VectorBeta::DISABLED,
+    VectorScale vector_scale_mode_ = VectorScale::DISABLED,
     cutlass::Distribution::Kind init_C_ = cutlass::Distribution::Uniform,
     cutlass::Distribution::Kind init_scale_ = cutlass::Distribution::Uniform,
     cutlass::Distribution::Kind init_bias_ = cutlass::Distribution::Uniform,
@@ -850,7 +850,7 @@ struct HostCollectiveEpilogue {
   std::vector<cutlass::HostTensor<ElementC, LayoutTagC>> tensors_C;
   cutlass::DeviceAllocation<const ElementC *> device_tensors_C;
   cutlass::HostTensor<ElementCompute, LayoutTagScalar> norm_constant;
-
+  
   // Outputs
   cutlass::HostTensor<ElementAmax, LayoutTagScalar> abs_max_Aux;
   cutlass::HostTensor<ElementAmax, LayoutTagScalar> abs_max_D;
@@ -871,8 +871,8 @@ struct HostCollectiveEpilogue {
   CheckEquality check_relative_equality = CheckEquality::EXACT;
   // Are scalars copied to device memory before kernel launch
   ScalarLoc use_device_scalars = ScalarLoc::ON_HOST;
-  // If per-row scale is enabled and this is true, beta is passed as a host scalar instead of device vector
-  VectorBeta disable_vector_beta = VectorBeta::DISABLED;
+  // If per-row scale is enabled and this is disabled, alpha/beta are passed as a host or device scalar instead of device vector
+  VectorScale vector_scale_mode = VectorScale::DISABLED;
 
   // Random distribution with which to initialize the A/B/C/D/Aux scaling factors
   cutlass::Distribution::Kind init_scale = cutlass::Distribution::Uniform;
@@ -885,7 +885,7 @@ struct HostCollectiveEpilogue {
   HostCollectiveEpilogue(
     CheckEquality check_relative_equality_ = CheckEquality::EXACT,
     ScalarLoc use_device_scalars_ = ScalarLoc::ON_HOST,
-    VectorBeta disable_vector_beta_ = VectorBeta::DISABLED,
+    VectorScale vector_scale_mode_ = VectorScale::DISABLED,
     cutlass::Distribution::Kind init_C_ = cutlass::Distribution::Uniform,
     cutlass::Distribution::Kind init_scale_ = cutlass::Distribution::Uniform,
     cutlass::Distribution::Kind init_bias_ = cutlass::Distribution::Uniform,
@@ -932,7 +932,7 @@ struct HostCollectiveEpilogue {
     if constexpr (IsPerRowScaleEnabled) {
       alpha.resize(col_vector_coord);
       EXPECT_TRUE(initialize_tensor(alpha.host_view(), init_scale, seed + 2023));
-      if (disable_vector_beta == VectorBeta::DISABLED) {
+      if (vector_scale_mode == VectorScale::DISABLED) {
         beta.resize(scalar_coord, false);
         cutlass::reference::host::TensorFill(beta.host_view(), beta_);
       }
@@ -1004,8 +1004,8 @@ struct HostCollectiveEpilogue {
       }
       stride_Aux = cutlass::make_cute_packed_stride(cutlass::gemm::TagToStrideC_t<LayoutTagAux>{}, cute::make_shape(M, N, 1));
     }
-    
-    static_assert(!IsGroupGemm or (IsGroupGemm and IsAuxOutEnabled));
+
+    static_assert(!IsGroupGemm or (IsGroupGemm and !IsAuxOutEnabled));
 
     if constexpr (IsAuxOutEnabled) {
       for (int32_t i = 0; i < L; ++i) {
@@ -1250,9 +1250,10 @@ struct HostCollectiveEpilogue {
     else {
       fusion_args.alpha = alpha.at(coord_0);
       fusion_args.beta = beta.at(coord_0);
-      fusion_args.alpha_ptr = alpha.device_data();
-      fusion_args.beta_ptr = beta.device_data(); // if disable_vector_beta is true this is nullptr
 
+      fusion_args.alpha_ptr = alpha.device_data();
+      fusion_args.beta_ptr = beta.device_data();
+      
       if constexpr (IsScaleFactorEnabled) {
         fusion_args.scale_a = scale_A.at(coord_0);
         fusion_args.scale_b = scale_B.at(coord_0);
@@ -1323,12 +1324,20 @@ struct HostCollectiveEpilogue {
         cute::make_layout(cute::make_shape(M, N, 1), stride_d_host[batch]));
     auto Bias = cute::make_tensor(detail::make_iterator(IsDeBiasEnabled ? reference_dbias.host_data() : bias.host_data()),
         cute::make_layout(cute::make_shape(M, cute::_1{})));
-    auto Aux = cute::make_tensor(detail::make_iterator(IsAuxInEnabled ? tensors_Aux[batch].host_data() : references_Aux[batch].host_data()),
-        cute::make_layout(cute::make_shape(M, N, 1), stride_Aux));
+    auto Aux_layout = cute::make_layout(cute::make_shape(M, N, 1), stride_Aux);
+    auto Aux = [&]() {
+      auto ptr = recast_ptr<ElementAux>(nullptr);
+      if (IsAuxInEnabled) {
+        ptr = detail::make_iterator(tensors_Aux[batch].host_data());
+      } else if (IsAuxOutEnabled) {
+        ptr = detail::make_iterator(references_Aux[batch].host_data());
+      }
+      return cute::make_tensor(ptr, Aux_layout);
+    }();
     auto Valpha = cute::make_tensor(detail::make_iterator(alpha.host_data()),
-        cute::make_layout(cute::make_shape(M, cute::_1{})));
+        cute::make_layout(cute::make_shape(M, N, cute::_1{}), cute::make_stride(cute::_1{}, cute::_0{}, M)));
     auto Vbeta = cute::make_tensor(detail::make_iterator(beta.host_data()),
-        cute::make_layout(cute::make_shape(M, cute::_1{})));
+        cute::make_layout(cute::make_shape(M, N, cute::_1{}), cute::make_stride(cute::_1{}, cute::_0{}, N)));
 
     cutlass::reference::host::GettEpilogueParams<
       ElementScalar,
@@ -1380,7 +1389,7 @@ struct HostCollectiveEpilogue {
 
     if constexpr (IsPerRowScaleEnabled) {
       epilogue_params.Valpha = Valpha;
-      if (disable_vector_beta == VectorBeta::ENABLED) {
+      if (vector_scale_mode == VectorScale::ENABLED) {
         epilogue_params.Vbeta = Vbeta;
       }
     }
@@ -1434,7 +1443,7 @@ struct TestbedImpl {
   TestbedImpl(
     CheckEquality check_relative_equality_ = CheckEquality::EXACT,
     ScalarLoc use_device_scalars_ = ScalarLoc::ON_HOST,
-    VectorBeta disable_vector_beta_ = VectorBeta::DISABLED,
+    VectorScale vector_scale_mode_ = VectorScale::DISABLED,
     cutlass::Distribution::Kind init_A_ = cutlass::Distribution::Uniform,
     cutlass::Distribution::Kind init_B_ = cutlass::Distribution::Uniform,
     cutlass::Distribution::Kind init_C_ = cutlass::Distribution::Uniform,
@@ -1442,7 +1451,7 @@ struct TestbedImpl {
     cutlass::Distribution::Kind init_bias_ = cutlass::Distribution::Uniform,
     uint64_t seed_ = kDefaultSeed
   ): collective_mma_inputs(HostCollectiveMainloopType(check_relative_equality_, init_A_, init_B_, seed_)), 
-     collective_epilogue(CollectiveEpilogue(check_relative_equality_, use_device_scalars_, disable_vector_beta_, init_C_, init_scale_, init_bias_, seed_)) { }
+     collective_epilogue(CollectiveEpilogue(check_relative_equality_, use_device_scalars_, vector_scale_mode_, init_C_, init_scale_, init_bias_, seed_)) { }
 
   TestbedImpl(
     typename LayoutTagA::Stride stride_factor_A_,
@@ -1451,7 +1460,7 @@ struct TestbedImpl {
     typename LayoutTagD::Stride stride_factor_D_,
     CheckEquality check_relative_equality_ = CheckEquality::EXACT,
     ScalarLoc use_device_scalars_ = ScalarLoc::ON_HOST,
-    VectorBeta disable_vector_beta_ = VectorBeta::DISABLED,
+    VectorScale vector_scale_mode_ = VectorScale::DISABLED,
     cutlass::Distribution::Kind init_A_ = cutlass::Distribution::Uniform,
     cutlass::Distribution::Kind init_B_ = cutlass::Distribution::Uniform,
     cutlass::Distribution::Kind init_C_ = cutlass::Distribution::Uniform,
@@ -1459,7 +1468,7 @@ struct TestbedImpl {
     cutlass::Distribution::Kind init_bias_ = cutlass::Distribution::Uniform,
     uint64_t seed_ = kDefaultSeed
   ): collective_mma_inputs(HostCollectiveMainloopType(check_relative_equality_, stride_factor_A_, stride_factor_B_, init_A_, init_B_, seed_)),
-     collective_epilogue(CollectiveEpilogue(check_relative_equality_, use_device_scalars_, disable_vector_beta_, init_C_, init_scale_, init_bias_, seed_)) { }
+     collective_epilogue(CollectiveEpilogue(check_relative_equality_, use_device_scalars_, vector_scale_mode_, init_C_, init_scale_, init_bias_, seed_)) { }
 
   /// Initializes data structures
   bool initialize(ProblemShapeType problem_shapes, ElementScalar alpha_=1.f, ElementScalar beta_=0.f) {
@@ -1493,7 +1502,7 @@ struct TestbedImpl {
       file
         << "problem: " << ' ' << M << "x" << N << "x" << K << ", Batch count = " << batch
         << ", alpha: " << alpha << ", beta: " << beta << "\n\n";
-      
+
       collective_mma_inputs.print_tensors(file, batch);
       collective_epilogue.print_tensors(file, batch);
     }
@@ -1515,7 +1524,7 @@ struct TestbedImpl {
     for (int32_t i = 0; i < L; ++i) {
       auto mainloop_params = collective_mma_inputs.to_host_args(problem_shapes, i);
       auto epilogue_params = collective_epilogue.to_host_args(problem_shapes, i);
-      
+
       cutlass::reference::host::Gemm3x(mainloop_params, epilogue_params);
 
       passed &= compare_reference(problem_shapes, alpha, beta, i);
@@ -1523,7 +1532,7 @@ struct TestbedImpl {
     return passed;
   }
 
-	/// Determine if the CUDA device is sufficient to run the kernel
+  /// Determine if the CUDA device is sufficient to run the kernel
   bool sufficient() {
     //
     // Determine SMEM requirements and waive if not satisfied
@@ -1696,14 +1705,14 @@ struct Testbed3x {
   Testbed3x(
       CheckEquality check_relative_equality_ = CheckEquality::EXACT,
       ScalarLoc use_device_scalars_ = ScalarLoc::ON_DEVICE,
-      VectorBeta disable_vector_beta_ = VectorBeta::DISABLED,
+      VectorScale vector_scale_mode_ = VectorScale::DISABLED,
       cutlass::Distribution::Kind init_A_ = cutlass::Distribution::Uniform,
       cutlass::Distribution::Kind init_B_ = cutlass::Distribution::Uniform,
       cutlass::Distribution::Kind init_C_ = cutlass::Distribution::Uniform,
       cutlass::Distribution::Kind init_scale_ = cutlass::Distribution::Uniform,
       cutlass::Distribution::Kind init_bias_ = cutlass::Distribution::Uniform,
       uint64_t seed_ = TestBedImpl::kDefaultSeed)
-      : impl_(check_relative_equality_, use_device_scalars_, disable_vector_beta_, init_A_, init_B_, init_C_, init_scale_, init_bias_, seed_) {}
+      : impl_(check_relative_equality_, use_device_scalars_, vector_scale_mode_, init_A_, init_B_, init_C_, init_scale_, init_bias_, seed_) {}
 
   /// Executes one test
   bool run(
@@ -1726,7 +1735,7 @@ bool TestAll(double alpha = 1.0, double beta = 0.0, CheckEquality check_relative
   using ElementScalar = typename Gemm::EpilogueOutputOp::ElementScalar;
   using ProblemShapeType = typename Gemm::GemmKernel::ProblemShape;
 
-  Testbed3x<Gemm, ActivationFunctor> testbed(check_relative_equality, ScalarLoc::ON_DEVICE, VectorBeta::DISABLED);
+  Testbed3x<Gemm, ActivationFunctor> testbed(check_relative_equality, ScalarLoc::ON_DEVICE, VectorScale::DISABLED);
 
   int max_alignment = std::max(Gemm::kAlignmentA, Gemm::kAlignmentB);
   std::vector<int> problem_size_m = {max_alignment, 512 - 3 * max_alignment};
diff --git a/test/unit/gemm/device/gemm_testbed_3x_tensor_broadcast.hpp b/test/unit/gemm/device/gemm_testbed_3x_tensor_broadcast.hpp
index fb36da25c2..1c3e6448c8 100644
--- a/test/unit/gemm/device/gemm_testbed_3x_tensor_broadcast.hpp
+++ b/test/unit/gemm/device/gemm_testbed_3x_tensor_broadcast.hpp
@@ -101,7 +101,7 @@ struct Testbed3xTensorBroadcast {
     cutlass::Distribution::Kind init_C_ = cutlass::Distribution::Uniform,
     uint64_t seed_ = TestBedImpl::kDefaultSeed
   ) :
-    impl_(CheckEquality::EXACT, ScalarLoc::ON_DEVICE, VectorBeta::ENABLED,
+    impl_(CheckEquality::EXACT, ScalarLoc::ON_DEVICE, VectorScale::ENABLED,
           init_A_, init_B_, init_C_, cutlass::Distribution::Uniform, cutlass::Distribution::Uniform, seed_) { }
 
   Testbed3xTensorBroadcast(
@@ -118,7 +118,7 @@ struct Testbed3xTensorBroadcast {
           stride_factor_B_,
           stride_factor_C_,
           stride_factor_D_,
-          CheckEquality::EXACT, ScalarLoc::ON_HOST, VectorBeta::ENABLED,
+          CheckEquality::EXACT, ScalarLoc::ON_HOST, VectorScale::ENABLED,
           init_A_,
           init_B_,
           init_C_,
@@ -255,9 +255,9 @@ struct Testbed3xTensorBroadcast {
     auto dummy_Aux = cute::make_tensor(static_cast<ElementD*>(nullptr),
         cute::make_layout(cute::make_shape(M, N, L), impl_.collective_epilogue.stride_d));
     auto dummy_Valpha = cute::make_tensor(static_cast<ElementCompute*>(nullptr),
-        cute::make_layout(cute::make_shape(M, 1)));
+        cute::make_layout(cute::make_shape(M, N, 1), cute::make_stride(cute::_1{}, cute::_0{}, M)));
     auto dummy_Vbeta = cute::make_tensor(static_cast<ElementCompute*>(nullptr),
-        cute::make_layout(cute::make_shape(M, 1)));
+        cute::make_layout(cute::make_shape(M, N, 1), cute::make_stride(cute::_1{}, cute::_0{}, M)));
     cutlass::reference::host::GettEpilogueParams<
         ElementScalar,
         ElementScalar,
diff --git a/test/unit/gemm/device/gemm_universal_bf16t_s8n_f32t_mixed_input_tensor_op_f32_sm80.cu b/test/unit/gemm/device/gemm_universal_bf16t_s8n_f32t_mixed_input_tensor_op_f32_sm80.cu
new file mode 100644
index 0000000000..bbeb9a1610
--- /dev/null
+++ b/test/unit/gemm/device/gemm_universal_bf16t_s8n_f32t_mixed_input_tensor_op_f32_sm80.cu
@@ -0,0 +1,97 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/*! \file
+    \brief Tests for device-wide GEMM interface
+    
+*/
+
+#include <iostream>
+
+#include "../../common/cutlass_unit_test.h"
+#include "cutlass/cutlass.h"
+
+#include "cutlass/gemm/device/gemm_universal.h"
+
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/reference/host/gemm.h"
+#include "cutlass/util/reference/host/tensor_compare.h"
+#include "cutlass/util/reference/host/tensor_copy.h"
+#include "cutlass/util/reference/host/tensor_fill.h"
+#include "cutlass/util/tensor_view_io.h"
+
+#include "testbed_universal.h"
+
+////////////////////////////////////////////////////////////////////////////////
+
+#if defined(CUTLASS_ARCH_MMA_SM80_SUPPORTED)
+
+////////////////////////////////////////////////////////////////////////////////
+
+
+TEST(SM80_Device_GemmUniversal_bf16t_s8n_f32t_mixed_input_tensor_op_f32, 128x128x64_64x64x64) {
+
+  using ElementA = cutlass::bfloat16_t;
+  using ElementB = int8_t;
+  using ElementOutput = float;
+  using ElementAccumulator = float;
+
+  using Gemm = cutlass::gemm::device::GemmUniversal<
+    ElementA, 
+    cutlass::layout::RowMajor, 
+    ElementB,
+    cutlass::layout::ColumnMajor, 
+    ElementOutput, 
+    cutlass::layout::RowMajor,
+    ElementAccumulator, 
+    cutlass::arch::OpClassTensorOp, 
+    cutlass::arch::Sm80,
+    cutlass::gemm::GemmShape<128, 128, 64>,
+    cutlass::gemm::GemmShape<64, 64, 64>,
+    cutlass::gemm::GemmShape<16, 8, 16>,
+      cutlass::epilogue::thread::LinearCombination<
+          ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
+          ElementAccumulator, ElementAccumulator>,
+    cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 
+    4,  // Stages
+    8,  // AlignmentA
+    16, // AlignmentB
+    cutlass::arch::OpMultiplyAddMixedInputUpcast,
+    cutlass::ComplexTransform::kNone,
+    cutlass::ComplexTransform::kNone
+  >;
+
+  EXPECT_TRUE(test::gemm::device::TestAllGemmUniversal<Gemm>());
+}
+////////////////////////////////////////////////////////////////////////////////
+
+#endif // #if defined(CUTLASS_ARCH_MMA_SM80_SUPPORTED)
+
+////////////////////////////////////////////////////////////////////////////////
diff --git a/test/unit/gemm/device/gemm_universal_bf16t_u8n_bf16t_mixed_input_tensor_op_f32_sm80.cu b/test/unit/gemm/device/gemm_universal_bf16t_u8n_bf16t_mixed_input_tensor_op_f32_sm80.cu
new file mode 100644
index 0000000000..49d484925b
--- /dev/null
+++ b/test/unit/gemm/device/gemm_universal_bf16t_u8n_bf16t_mixed_input_tensor_op_f32_sm80.cu
@@ -0,0 +1,97 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/*! \file
+    \brief Tests for device-wide GEMM interface
+
+*/
+
+#include <iostream>
+
+#include "../../common/cutlass_unit_test.h"
+#include "cutlass/cutlass.h"
+
+#include "cutlass/gemm/device/gemm_universal.h"
+
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/reference/host/gemm.h"
+#include "cutlass/util/reference/host/tensor_compare.h"
+#include "cutlass/util/reference/host/tensor_copy.h"
+#include "cutlass/util/reference/host/tensor_fill.h"
+#include "cutlass/util/tensor_view_io.h"
+
+#include "testbed_universal.h"
+
+////////////////////////////////////////////////////////////////////////////////
+
+#if defined(CUTLASS_ARCH_MMA_SM80_SUPPORTED)
+
+////////////////////////////////////////////////////////////////////////////////
+
+
+TEST(SM80_Device_GemmUniversal_bf16t_u8n_bf16t_mixed_input_tensor_op_f32, 128x128x64_64x64x64) {
+
+  using ElementA = cutlass::bfloat16_t;
+  using ElementB = uint8_t;
+  using ElementOutput = cutlass::bfloat16_t;
+  using ElementAccumulator = float;
+
+  using Gemm = cutlass::gemm::device::GemmUniversal<
+    ElementA,
+    cutlass::layout::RowMajor,
+    ElementB,
+    cutlass::layout::ColumnMajor,
+    ElementOutput,
+    cutlass::layout::RowMajor,
+    ElementAccumulator,
+    cutlass::arch::OpClassTensorOp,
+    cutlass::arch::Sm80,
+    cutlass::gemm::GemmShape<128, 128, 64>,
+    cutlass::gemm::GemmShape<64, 64, 64>,
+    cutlass::gemm::GemmShape<16, 8, 16>,
+      cutlass::epilogue::thread::LinearCombination<
+          ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
+          ElementAccumulator, ElementAccumulator>,
+    cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>,
+    4,  // Stages
+    8,  // AlignmentA
+    16, // AlignmentB
+    cutlass::arch::OpMultiplyAddMixedInputUpcast,
+    cutlass::ComplexTransform::kNone,
+    cutlass::ComplexTransform::kNone
+  >;
+
+  EXPECT_TRUE(test::gemm::device::TestAllGemmUniversal<Gemm>());
+}
+////////////////////////////////////////////////////////////////////////////////
+
+#endif // #if defined(CUTLASS_ARCH_MMA_SM80_SUPPORTED)
+
+////////////////////////////////////////////////////////////////////////////////
diff --git a/test/unit/gemm/device/gemm_universal_bf16t_u8n_f32t_mixed_input_tensor_op_f32_sm80.cu b/test/unit/gemm/device/gemm_universal_bf16t_u8n_f32t_mixed_input_tensor_op_f32_sm80.cu
new file mode 100644
index 0000000000..93c59c5178
--- /dev/null
+++ b/test/unit/gemm/device/gemm_universal_bf16t_u8n_f32t_mixed_input_tensor_op_f32_sm80.cu
@@ -0,0 +1,97 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/*! \file
+    \brief Tests for device-wide GEMM interface
+    
+*/
+
+#include <iostream>
+
+#include "../../common/cutlass_unit_test.h"
+#include "cutlass/cutlass.h"
+
+#include "cutlass/gemm/device/gemm_universal.h"
+
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/reference/host/gemm.h"
+#include "cutlass/util/reference/host/tensor_compare.h"
+#include "cutlass/util/reference/host/tensor_copy.h"
+#include "cutlass/util/reference/host/tensor_fill.h"
+#include "cutlass/util/tensor_view_io.h"
+
+#include "testbed_universal.h"
+
+////////////////////////////////////////////////////////////////////////////////
+
+#if defined(CUTLASS_ARCH_MMA_SM80_SUPPORTED)
+
+////////////////////////////////////////////////////////////////////////////////
+
+
+TEST(SM80_Device_GemmUniversal_bf16t_u8n_f32t_mixed_input_tensor_op_f32, 128x128x64_64x64x64) {
+
+  using ElementA = cutlass::bfloat16_t;
+  using ElementB = uint8_t;
+  using ElementOutput = float;
+  using ElementAccumulator = float;
+
+  using Gemm = cutlass::gemm::device::GemmUniversal<
+    ElementA, 
+    cutlass::layout::RowMajor, 
+    ElementB,
+    cutlass::layout::ColumnMajor, 
+    ElementOutput, 
+    cutlass::layout::RowMajor,
+    ElementAccumulator, 
+    cutlass::arch::OpClassTensorOp, 
+    cutlass::arch::Sm80,
+    cutlass::gemm::GemmShape<128, 128, 64>,
+    cutlass::gemm::GemmShape<64, 64, 64>,
+    cutlass::gemm::GemmShape<16, 8, 16>,
+      cutlass::epilogue::thread::LinearCombination<
+          ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
+          ElementAccumulator, ElementAccumulator>,
+    cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 
+    4,  // Stages
+    8,  // AlignmentA
+    16, // AlignmentB
+    cutlass::arch::OpMultiplyAddMixedInputUpcast,
+    cutlass::ComplexTransform::kNone,
+    cutlass::ComplexTransform::kNone
+  >;
+
+  EXPECT_TRUE(test::gemm::device::TestAllGemmUniversal<Gemm>());
+}
+////////////////////////////////////////////////////////////////////////////////
+
+#endif // #if defined(CUTLASS_ARCH_MMA_SM80_SUPPORTED)
+
+////////////////////////////////////////////////////////////////////////////////
diff --git a/test/unit/gemm/device/gemm_universal_f16t_s8n_f16t_mixed_input_tensor_op_f32_sm80.cu b/test/unit/gemm/device/gemm_universal_f16t_s8n_f16t_mixed_input_tensor_op_f32_sm80.cu
new file mode 100644
index 0000000000..86d1da774e
--- /dev/null
+++ b/test/unit/gemm/device/gemm_universal_f16t_s8n_f16t_mixed_input_tensor_op_f32_sm80.cu
@@ -0,0 +1,97 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/*! \file
+    \brief Tests for device-wide GEMM interface
+    
+*/
+
+#include <iostream>
+
+#include "../../common/cutlass_unit_test.h"
+#include "cutlass/cutlass.h"
+
+#include "cutlass/gemm/device/gemm_universal.h"
+
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/reference/host/gemm.h"
+#include "cutlass/util/reference/host/tensor_compare.h"
+#include "cutlass/util/reference/host/tensor_copy.h"
+#include "cutlass/util/reference/host/tensor_fill.h"
+#include "cutlass/util/tensor_view_io.h"
+
+#include "testbed_universal.h"
+
+////////////////////////////////////////////////////////////////////////////////
+
+#if defined(CUTLASS_ARCH_MMA_SM80_SUPPORTED)
+
+////////////////////////////////////////////////////////////////////////////////
+
+
+TEST(SM80_Device_GemmUniversal_f16t_s8n_f16t_mixed_input_tensor_op_f32, 128x128x64_64x64x64) {
+
+  using ElementA = cutlass::half_t;
+  using ElementB = int8_t;
+  using ElementOutput = cutlass::half_t;
+  using ElementAccumulator = float;
+
+  using Gemm = cutlass::gemm::device::GemmUniversal<
+    ElementA, 
+    cutlass::layout::RowMajor, 
+    ElementB,
+    cutlass::layout::ColumnMajor, 
+    ElementOutput, 
+    cutlass::layout::RowMajor,
+    ElementAccumulator, 
+    cutlass::arch::OpClassTensorOp, 
+    cutlass::arch::Sm80,
+    cutlass::gemm::GemmShape<128, 128, 64>,
+    cutlass::gemm::GemmShape<64, 64, 64>,
+    cutlass::gemm::GemmShape<16, 8, 16>,
+      cutlass::epilogue::thread::LinearCombination<
+          ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
+          ElementAccumulator, ElementAccumulator>,
+    cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 
+    4,  // Stages
+    8,  // AlignmentA
+    16, // AlignmentB
+    cutlass::arch::OpMultiplyAddMixedInputUpcast,
+    cutlass::ComplexTransform::kNone,
+    cutlass::ComplexTransform::kNone
+  >;
+
+  EXPECT_TRUE(test::gemm::device::TestAllGemmUniversal<Gemm>());
+}
+////////////////////////////////////////////////////////////////////////////////
+
+#endif // #if defined(CUTLASS_ARCH_MMA_SM80_SUPPORTED)
+
+////////////////////////////////////////////////////////////////////////////////
diff --git a/test/unit/gemm/device/gemm_universal_f16t_s8n_f32t_mixed_input_tensor_op_f32_sm80.cu b/test/unit/gemm/device/gemm_universal_f16t_s8n_f32t_mixed_input_tensor_op_f32_sm80.cu
new file mode 100644
index 0000000000..20da1150d0
--- /dev/null
+++ b/test/unit/gemm/device/gemm_universal_f16t_s8n_f32t_mixed_input_tensor_op_f32_sm80.cu
@@ -0,0 +1,97 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/*! \file
+    \brief Tests for device-wide GEMM interface
+    
+*/
+
+#include <iostream>
+
+#include "../../common/cutlass_unit_test.h"
+#include "cutlass/cutlass.h"
+
+#include "cutlass/gemm/device/gemm_universal.h"
+
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/reference/host/gemm.h"
+#include "cutlass/util/reference/host/tensor_compare.h"
+#include "cutlass/util/reference/host/tensor_copy.h"
+#include "cutlass/util/reference/host/tensor_fill.h"
+#include "cutlass/util/tensor_view_io.h"
+
+#include "testbed_universal.h"
+
+////////////////////////////////////////////////////////////////////////////////
+
+#if defined(CUTLASS_ARCH_MMA_SM80_SUPPORTED)
+
+////////////////////////////////////////////////////////////////////////////////
+
+
+TEST(SM80_Device_GemmUniversal_f16t_s8n_f32t_mixed_input_tensor_op_f32, 128x128x64_64x64x64) {
+
+  using ElementA = cutlass::half_t;
+  using ElementB = int8_t;
+  using ElementOutput = float;
+  using ElementAccumulator = float;
+
+  using Gemm = cutlass::gemm::device::GemmUniversal<
+    ElementA, 
+    cutlass::layout::RowMajor, 
+    ElementB,
+    cutlass::layout::ColumnMajor, 
+    ElementOutput, 
+    cutlass::layout::RowMajor,
+    ElementAccumulator, 
+    cutlass::arch::OpClassTensorOp, 
+    cutlass::arch::Sm80,
+    cutlass::gemm::GemmShape<128, 128, 64>,
+    cutlass::gemm::GemmShape<64, 64, 64>,
+    cutlass::gemm::GemmShape<16, 8, 16>,
+      cutlass::epilogue::thread::LinearCombination<
+          ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
+          ElementAccumulator, ElementAccumulator>,
+    cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 
+    4,  // Stages
+    8,  // AlignmentA
+    16, // AlignmentB
+    cutlass::arch::OpMultiplyAddMixedInputUpcast,
+    cutlass::ComplexTransform::kNone,
+    cutlass::ComplexTransform::kNone
+  >;
+
+  EXPECT_TRUE(test::gemm::device::TestAllGemmUniversal<Gemm>());
+}
+////////////////////////////////////////////////////////////////////////////////
+
+#endif // #if defined(CUTLASS_ARCH_MMA_SM80_SUPPORTED)
+
+////////////////////////////////////////////////////////////////////////////////
diff --git a/test/unit/gemm/device/gemm_universal_f16t_u8n_f16t_mixed_input_tensor_op_f16_sm80.cu b/test/unit/gemm/device/gemm_universal_f16t_u8n_f16t_mixed_input_tensor_op_f16_sm80.cu
index d53711cb44..9b105c9eeb 100644
--- a/test/unit/gemm/device/gemm_universal_f16t_u8n_f16t_mixed_input_tensor_op_f16_sm80.cu
+++ b/test/unit/gemm/device/gemm_universal_f16t_u8n_f16t_mixed_input_tensor_op_f16_sm80.cu
@@ -56,7 +56,7 @@
 ////////////////////////////////////////////////////////////////////////////////
 
 
-TEST(SM80_Device_GemmUniversal_f16t_u8t_f16t_mixed_input_tensor_op_f16, 128x128x64_64x64x64) {
+TEST(SM80_Device_GemmUniversal_f16t_u8n_f16t_mixed_input_tensor_op_f16, 128x128x64_64x64x64) {
 
   using ElementA = cutlass::half_t;
   using ElementB = uint8_t;
diff --git a/test/unit/gemm/device/gemm_universal_f16t_u8n_f16t_mixed_input_tensor_op_f32_sm80.cu b/test/unit/gemm/device/gemm_universal_f16t_u8n_f16t_mixed_input_tensor_op_f32_sm80.cu
new file mode 100644
index 0000000000..b26b213638
--- /dev/null
+++ b/test/unit/gemm/device/gemm_universal_f16t_u8n_f16t_mixed_input_tensor_op_f32_sm80.cu
@@ -0,0 +1,97 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/*! \file
+    \brief Tests for device-wide GEMM interface
+    
+*/
+
+#include <iostream>
+
+#include "../../common/cutlass_unit_test.h"
+#include "cutlass/cutlass.h"
+
+#include "cutlass/gemm/device/gemm_universal.h"
+
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/reference/host/gemm.h"
+#include "cutlass/util/reference/host/tensor_compare.h"
+#include "cutlass/util/reference/host/tensor_copy.h"
+#include "cutlass/util/reference/host/tensor_fill.h"
+#include "cutlass/util/tensor_view_io.h"
+
+#include "testbed_universal.h"
+
+////////////////////////////////////////////////////////////////////////////////
+
+#if defined(CUTLASS_ARCH_MMA_SM80_SUPPORTED)
+
+////////////////////////////////////////////////////////////////////////////////
+
+
+TEST(SM80_Device_GemmUniversal_f16t_u8n_f16t_mixed_input_tensor_op_f32, 128x128x64_64x64x64) {
+
+  using ElementA = cutlass::half_t;
+  using ElementB = uint8_t;
+  using ElementOutput = cutlass::half_t;
+  using ElementAccumulator = float;
+
+  using Gemm = cutlass::gemm::device::GemmUniversal<
+    ElementA, 
+    cutlass::layout::RowMajor, 
+    ElementB,
+    cutlass::layout::ColumnMajor, 
+    ElementOutput, 
+    cutlass::layout::RowMajor,
+    ElementAccumulator, 
+    cutlass::arch::OpClassTensorOp, 
+    cutlass::arch::Sm80,
+    cutlass::gemm::GemmShape<128, 128, 64>,
+    cutlass::gemm::GemmShape<64, 64, 64>,
+    cutlass::gemm::GemmShape<16, 8, 16>,
+      cutlass::epilogue::thread::LinearCombination<
+          ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
+          ElementAccumulator, ElementAccumulator>,
+    cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 
+    4,  // Stages
+    8,  // AlignmentA
+    16, // AlignmentB
+    cutlass::arch::OpMultiplyAddMixedInputUpcast,
+    cutlass::ComplexTransform::kNone,
+    cutlass::ComplexTransform::kNone
+  >;
+
+  EXPECT_TRUE(test::gemm::device::TestAllGemmUniversal<Gemm>());
+}
+////////////////////////////////////////////////////////////////////////////////
+
+#endif // #if defined(CUTLASS_ARCH_MMA_SM80_SUPPORTED)
+
+////////////////////////////////////////////////////////////////////////////////
diff --git a/test/unit/gemm/device/gemm_universal_f16t_u8n_f32t_mixed_input_tensor_op_f32_sm80.cu b/test/unit/gemm/device/gemm_universal_f16t_u8n_f32t_mixed_input_tensor_op_f32_sm80.cu
new file mode 100644
index 0000000000..926a88e8e4
--- /dev/null
+++ b/test/unit/gemm/device/gemm_universal_f16t_u8n_f32t_mixed_input_tensor_op_f32_sm80.cu
@@ -0,0 +1,97 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/*! \file
+    \brief Tests for device-wide GEMM interface
+    
+*/
+
+#include <iostream>
+
+#include "../../common/cutlass_unit_test.h"
+#include "cutlass/cutlass.h"
+
+#include "cutlass/gemm/device/gemm_universal.h"
+
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/reference/host/gemm.h"
+#include "cutlass/util/reference/host/tensor_compare.h"
+#include "cutlass/util/reference/host/tensor_copy.h"
+#include "cutlass/util/reference/host/tensor_fill.h"
+#include "cutlass/util/tensor_view_io.h"
+
+#include "testbed_universal.h"
+
+////////////////////////////////////////////////////////////////////////////////
+
+#if defined(CUTLASS_ARCH_MMA_SM80_SUPPORTED)
+
+////////////////////////////////////////////////////////////////////////////////
+
+
+TEST(SM80_Device_GemmUniversal_f16t_u8n_f32t_mixed_input_tensor_op_f32, 128x128x64_64x64x64) {
+
+  using ElementA = cutlass::half_t;
+  using ElementB = uint8_t;
+  using ElementOutput = float;
+  using ElementAccumulator = float;
+
+  using Gemm = cutlass::gemm::device::GemmUniversal<
+    ElementA, 
+    cutlass::layout::RowMajor, 
+    ElementB,
+    cutlass::layout::ColumnMajor, 
+    ElementOutput, 
+    cutlass::layout::RowMajor,
+    ElementAccumulator, 
+    cutlass::arch::OpClassTensorOp, 
+    cutlass::arch::Sm80,
+    cutlass::gemm::GemmShape<128, 128, 64>,
+    cutlass::gemm::GemmShape<64, 64, 64>,
+    cutlass::gemm::GemmShape<16, 8, 16>,
+      cutlass::epilogue::thread::LinearCombination<
+          ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
+          ElementAccumulator, ElementAccumulator>,
+    cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 
+    4,  // Stages
+    8,  // AlignmentA
+    16, // AlignmentB
+    cutlass::arch::OpMultiplyAddMixedInputUpcast,
+    cutlass::ComplexTransform::kNone,
+    cutlass::ComplexTransform::kNone
+  >;
+
+  EXPECT_TRUE(test::gemm::device::TestAllGemmUniversal<Gemm>());
+}
+////////////////////////////////////////////////////////////////////////////////
+
+#endif // #if defined(CUTLASS_ARCH_MMA_SM80_SUPPORTED)
+
+////////////////////////////////////////////////////////////////////////////////
diff --git a/test/unit/gemm/device/gemm_universal_s4t_s8n_s32t_mixed_input_tensor_op_s32_sm80.cu b/test/unit/gemm/device/gemm_universal_s4t_s8n_s32t_mixed_input_tensor_op_s32_sm80.cu
new file mode 100644
index 0000000000..421ea0c0b2
--- /dev/null
+++ b/test/unit/gemm/device/gemm_universal_s4t_s8n_s32t_mixed_input_tensor_op_s32_sm80.cu
@@ -0,0 +1,95 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/*! \file
+    \brief Tests for device-wide GEMM interface
+    
+*/
+
+#include "../../common/cutlass_unit_test.h"
+#include "cutlass/cutlass.h"
+
+#include "cutlass/gemm/device/gemm_universal.h"
+
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/reference/host/gemm.h"
+#include "cutlass/util/reference/host/tensor_compare.h"
+#include "cutlass/util/reference/host/tensor_copy.h"
+#include "cutlass/util/reference/host/tensor_fill.h"
+#include "cutlass/util/tensor_view_io.h"
+
+#include "testbed_universal.h"
+
+////////////////////////////////////////////////////////////////////////////////
+
+#if defined(CUTLASS_ARCH_MMA_SM80_SUPPORTED)
+
+////////////////////////////////////////////////////////////////////////////////
+
+
+TEST(SM80_Device_GemmUniversal_s4t_s8n_s32t_mixed_input_tensor_op_s32, 128x128x64_64x64x64) {
+
+  using ElementA = cutlass::int4b_t;
+  using ElementB = int8_t;
+  using ElementOutput = int32_t;
+  using ElementAccumulator = int32_t;
+
+  using Gemm = cutlass::gemm::device::GemmUniversal<
+    ElementA, 
+    cutlass::layout::RowMajor, 
+    ElementB,
+    cutlass::layout::ColumnMajor, 
+    ElementOutput, 
+    cutlass::layout::RowMajor,
+    ElementAccumulator, 
+    cutlass::arch::OpClassTensorOp, 
+    cutlass::arch::Sm80,
+    cutlass::gemm::GemmShape<128, 128, 64>,
+    cutlass::gemm::GemmShape<64, 64, 64>,
+    cutlass::gemm::GemmShape<16, 8, 32>,
+      cutlass::epilogue::thread::LinearCombination<
+          ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
+          ElementAccumulator, ElementAccumulator>,
+    cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 
+    4,   // Stages
+    32,  // AlignmentA
+    16,  // AlignmentB
+    cutlass::arch::OpMultiplyAddMixedInputUpcast,
+    cutlass::ComplexTransform::kNone,
+    cutlass::ComplexTransform::kNone
+  >;
+
+  EXPECT_TRUE(test::gemm::device::TestAllGemmUniversal<Gemm>());
+}
+////////////////////////////////////////////////////////////////////////////////
+
+#endif // #if defined(CUTLASS_ARCH_MMA_SM80_SUPPORTED)
+
+////////////////////////////////////////////////////////////////////////////////
diff --git a/test/unit/gemm/device/gemm_universal_s4t_s8n_s8t_mixed_input_tensor_op_s32_sm80.cu b/test/unit/gemm/device/gemm_universal_s4t_s8n_s8t_mixed_input_tensor_op_s32_sm80.cu
new file mode 100644
index 0000000000..685092fb84
--- /dev/null
+++ b/test/unit/gemm/device/gemm_universal_s4t_s8n_s8t_mixed_input_tensor_op_s32_sm80.cu
@@ -0,0 +1,95 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/*! \file
+    \brief Tests for device-wide GEMM interface
+    
+*/
+
+#include "../../common/cutlass_unit_test.h"
+#include "cutlass/cutlass.h"
+
+#include "cutlass/gemm/device/gemm_universal.h"
+
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/reference/host/gemm.h"
+#include "cutlass/util/reference/host/tensor_compare.h"
+#include "cutlass/util/reference/host/tensor_copy.h"
+#include "cutlass/util/reference/host/tensor_fill.h"
+#include "cutlass/util/tensor_view_io.h"
+
+#include "testbed_universal.h"
+
+////////////////////////////////////////////////////////////////////////////////
+
+#if defined(CUTLASS_ARCH_MMA_SM80_SUPPORTED)
+
+////////////////////////////////////////////////////////////////////////////////
+
+
+TEST(SM80_Device_GemmUniversal_s4t_s8n_s8t_mixed_input_tensor_op_s32, 128x128x64_64x64x64) {
+
+  using ElementA = cutlass::int4b_t;
+  using ElementB = int8_t;
+  using ElementOutput = int8_t;
+  using ElementAccumulator = int32_t;
+
+  using Gemm = cutlass::gemm::device::GemmUniversal<
+    ElementA, 
+    cutlass::layout::RowMajor, 
+    ElementB,
+    cutlass::layout::ColumnMajor, 
+    ElementOutput, 
+    cutlass::layout::RowMajor,
+    ElementAccumulator, 
+    cutlass::arch::OpClassTensorOp, 
+    cutlass::arch::Sm80,
+    cutlass::gemm::GemmShape<128, 128, 64>,
+    cutlass::gemm::GemmShape<64, 64, 64>,
+    cutlass::gemm::GemmShape<16, 8, 32>,
+      cutlass::epilogue::thread::LinearCombination<
+          ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
+          ElementAccumulator, ElementAccumulator>,
+    cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 
+    4,   // Stages
+    32,  // AlignmentA
+    16,  // AlignmentB
+    cutlass::arch::OpMultiplyAddMixedInputUpcast,
+    cutlass::ComplexTransform::kNone,
+    cutlass::ComplexTransform::kNone
+  >;
+
+  EXPECT_TRUE(test::gemm::device::TestAllGemmUniversal<Gemm>());
+}
+////////////////////////////////////////////////////////////////////////////////
+
+#endif // #if defined(CUTLASS_ARCH_MMA_SM80_SUPPORTED)
+
+////////////////////////////////////////////////////////////////////////////////
diff --git a/test/unit/gemm/device/gemm_universal_s8t_bf16n_f32t_mixed_input_tensor_op_f32_sm80.cu b/test/unit/gemm/device/gemm_universal_s8t_bf16n_f32t_mixed_input_tensor_op_f32_sm80.cu
new file mode 100644
index 0000000000..8572e55981
--- /dev/null
+++ b/test/unit/gemm/device/gemm_universal_s8t_bf16n_f32t_mixed_input_tensor_op_f32_sm80.cu
@@ -0,0 +1,97 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/*! \file
+    \brief Tests for device-wide GEMM interface
+    
+*/
+
+#include <iostream>
+
+#include "../../common/cutlass_unit_test.h"
+#include "cutlass/cutlass.h"
+
+#include "cutlass/gemm/device/gemm_universal.h"
+
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/reference/host/gemm.h"
+#include "cutlass/util/reference/host/tensor_compare.h"
+#include "cutlass/util/reference/host/tensor_copy.h"
+#include "cutlass/util/reference/host/tensor_fill.h"
+#include "cutlass/util/tensor_view_io.h"
+
+#include "testbed_universal.h"
+
+////////////////////////////////////////////////////////////////////////////////
+
+#if defined(CUTLASS_ARCH_MMA_SM80_SUPPORTED)
+
+////////////////////////////////////////////////////////////////////////////////
+
+
+TEST(SM80_Device_GemmUniversal_s8t_bf16n_f32t_mixed_input_tensor_op_f32, 128x128x64_64x64x64) {
+
+  using ElementA = int8_t;
+  using ElementB = cutlass::bfloat16_t;
+  using ElementOutput = float;
+  using ElementAccumulator = float;
+
+  using Gemm = cutlass::gemm::device::GemmUniversal<
+    ElementA, 
+    cutlass::layout::RowMajor, 
+    ElementB,
+    cutlass::layout::ColumnMajor, 
+    ElementOutput, 
+    cutlass::layout::RowMajor,
+    ElementAccumulator, 
+    cutlass::arch::OpClassTensorOp, 
+    cutlass::arch::Sm80,
+    cutlass::gemm::GemmShape<128, 128, 64>,
+    cutlass::gemm::GemmShape<64, 64, 64>,
+    cutlass::gemm::GemmShape<16, 8, 16>,
+      cutlass::epilogue::thread::LinearCombination<
+          ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
+          ElementAccumulator, ElementAccumulator>,
+    cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 
+    4,   // Stages
+    16,  // AlignmentA
+    8,   // AlignmentB
+    cutlass::arch::OpMultiplyAddMixedInputUpcast,
+    cutlass::ComplexTransform::kNone,
+    cutlass::ComplexTransform::kNone
+  >;
+
+  EXPECT_TRUE(test::gemm::device::TestAllGemmUniversal<Gemm>());
+}
+////////////////////////////////////////////////////////////////////////////////
+
+#endif // #if defined(CUTLASS_ARCH_MMA_SM80_SUPPORTED)
+
+////////////////////////////////////////////////////////////////////////////////
diff --git a/test/unit/gemm/device/gemm_universal_s8t_f16n_f16t_mixed_input_tensor_op_f32_sm80.cu b/test/unit/gemm/device/gemm_universal_s8t_f16n_f16t_mixed_input_tensor_op_f32_sm80.cu
new file mode 100644
index 0000000000..eb4e293a39
--- /dev/null
+++ b/test/unit/gemm/device/gemm_universal_s8t_f16n_f16t_mixed_input_tensor_op_f32_sm80.cu
@@ -0,0 +1,97 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/*! \file
+    \brief Tests for device-wide GEMM interface
+    
+*/
+
+#include <iostream>
+
+#include "../../common/cutlass_unit_test.h"
+#include "cutlass/cutlass.h"
+
+#include "cutlass/gemm/device/gemm_universal.h"
+
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/reference/host/gemm.h"
+#include "cutlass/util/reference/host/tensor_compare.h"
+#include "cutlass/util/reference/host/tensor_copy.h"
+#include "cutlass/util/reference/host/tensor_fill.h"
+#include "cutlass/util/tensor_view_io.h"
+
+#include "testbed_universal.h"
+
+////////////////////////////////////////////////////////////////////////////////
+
+#if defined(CUTLASS_ARCH_MMA_SM80_SUPPORTED)
+
+////////////////////////////////////////////////////////////////////////////////
+
+
+TEST(SM80_Device_GemmUniversal_s8t_f16n_f16t_mixed_input_tensor_op_f32, 128x128x64_64x64x64) {
+
+  using ElementA = int8_t;
+  using ElementB = cutlass::half_t;
+  using ElementOutput = cutlass::half_t;
+  using ElementAccumulator = float;
+
+  using Gemm = cutlass::gemm::device::GemmUniversal<
+    ElementA, 
+    cutlass::layout::RowMajor, 
+    ElementB,
+    cutlass::layout::ColumnMajor, 
+    ElementOutput, 
+    cutlass::layout::RowMajor,
+    ElementAccumulator, 
+    cutlass::arch::OpClassTensorOp, 
+    cutlass::arch::Sm80,
+    cutlass::gemm::GemmShape<128, 128, 64>,
+    cutlass::gemm::GemmShape<64, 64, 64>,
+    cutlass::gemm::GemmShape<16, 8, 16>,
+      cutlass::epilogue::thread::LinearCombination<
+          ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
+          ElementAccumulator, ElementAccumulator>,
+    cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 
+    4,   // Stages
+    16,  // AlignmentA
+    8,   // AlignmentB
+    cutlass::arch::OpMultiplyAddMixedInputUpcast,
+    cutlass::ComplexTransform::kNone,
+    cutlass::ComplexTransform::kNone
+  >;
+
+  EXPECT_TRUE(test::gemm::device::TestAllGemmUniversal<Gemm>());
+}
+////////////////////////////////////////////////////////////////////////////////
+
+#endif // #if defined(CUTLASS_ARCH_MMA_SM80_SUPPORTED)
+
+////////////////////////////////////////////////////////////////////////////////
diff --git a/test/unit/gemm/device/gemm_universal_s8t_f16n_f32t_mixed_input_tensor_op_f32_sm80.cu b/test/unit/gemm/device/gemm_universal_s8t_f16n_f32t_mixed_input_tensor_op_f32_sm80.cu
new file mode 100644
index 0000000000..064c9b048d
--- /dev/null
+++ b/test/unit/gemm/device/gemm_universal_s8t_f16n_f32t_mixed_input_tensor_op_f32_sm80.cu
@@ -0,0 +1,97 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/*! \file
+    \brief Tests for device-wide GEMM interface
+    
+*/
+
+#include <iostream>
+
+#include "../../common/cutlass_unit_test.h"
+#include "cutlass/cutlass.h"
+
+#include "cutlass/gemm/device/gemm_universal.h"
+
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/reference/host/gemm.h"
+#include "cutlass/util/reference/host/tensor_compare.h"
+#include "cutlass/util/reference/host/tensor_copy.h"
+#include "cutlass/util/reference/host/tensor_fill.h"
+#include "cutlass/util/tensor_view_io.h"
+
+#include "testbed_universal.h"
+
+////////////////////////////////////////////////////////////////////////////////
+
+#if defined(CUTLASS_ARCH_MMA_SM80_SUPPORTED)
+
+////////////////////////////////////////////////////////////////////////////////
+
+
+TEST(SM80_Device_GemmUniversal_s8t_f16n_f32t_mixed_input_tensor_op_f32, 128x128x64_64x64x64) {
+
+  using ElementA = int8_t;
+  using ElementB = cutlass::half_t;
+  using ElementOutput = float;
+  using ElementAccumulator = float;
+
+  using Gemm = cutlass::gemm::device::GemmUniversal<
+    ElementA, 
+    cutlass::layout::RowMajor, 
+    ElementB,
+    cutlass::layout::ColumnMajor, 
+    ElementOutput, 
+    cutlass::layout::RowMajor,
+    ElementAccumulator, 
+    cutlass::arch::OpClassTensorOp, 
+    cutlass::arch::Sm80,
+    cutlass::gemm::GemmShape<128, 128, 64>,
+    cutlass::gemm::GemmShape<64, 64, 64>,
+    cutlass::gemm::GemmShape<16, 8, 16>,
+      cutlass::epilogue::thread::LinearCombination<
+          ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
+          ElementAccumulator, ElementAccumulator>,
+    cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 
+    4,   // Stages
+    16,  // AlignmentA
+    8,   // AlignmentB
+    cutlass::arch::OpMultiplyAddMixedInputUpcast,
+    cutlass::ComplexTransform::kNone,
+    cutlass::ComplexTransform::kNone
+  >;
+
+  EXPECT_TRUE(test::gemm::device::TestAllGemmUniversal<Gemm>());
+}
+////////////////////////////////////////////////////////////////////////////////
+
+#endif // #if defined(CUTLASS_ARCH_MMA_SM80_SUPPORTED)
+
+////////////////////////////////////////////////////////////////////////////////
diff --git a/test/unit/gemm/device/gemm_universal_s8t_s4n_s32t_mixed_input_tensor_op_s32_sm80.cu b/test/unit/gemm/device/gemm_universal_s8t_s4n_s32t_mixed_input_tensor_op_s32_sm80.cu
new file mode 100644
index 0000000000..b28cee62c0
--- /dev/null
+++ b/test/unit/gemm/device/gemm_universal_s8t_s4n_s32t_mixed_input_tensor_op_s32_sm80.cu
@@ -0,0 +1,95 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/*! \file
+    \brief Tests for device-wide GEMM interface
+    
+*/
+
+#include "../../common/cutlass_unit_test.h"
+#include "cutlass/cutlass.h"
+
+#include "cutlass/gemm/device/gemm_universal.h"
+
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/reference/host/gemm.h"
+#include "cutlass/util/reference/host/tensor_compare.h"
+#include "cutlass/util/reference/host/tensor_copy.h"
+#include "cutlass/util/reference/host/tensor_fill.h"
+#include "cutlass/util/tensor_view_io.h"
+
+#include "testbed_universal.h"
+
+////////////////////////////////////////////////////////////////////////////////
+
+#if defined(CUTLASS_ARCH_MMA_SM80_SUPPORTED)
+
+////////////////////////////////////////////////////////////////////////////////
+
+
+TEST(SM80_Device_GemmUniversal_s8t_s4n_s32t_mixed_input_tensor_op_s32, 128x128x64_64x64x64) {
+
+  using ElementA = int8_t;
+  using ElementB = cutlass::int4b_t;
+  using ElementOutput = int32_t;
+  using ElementAccumulator = int32_t;
+
+  using Gemm = cutlass::gemm::device::GemmUniversal<
+    ElementA, 
+    cutlass::layout::RowMajor, 
+    ElementB,
+    cutlass::layout::ColumnMajor, 
+    ElementOutput, 
+    cutlass::layout::RowMajor,
+    ElementAccumulator, 
+    cutlass::arch::OpClassTensorOp, 
+    cutlass::arch::Sm80,
+    cutlass::gemm::GemmShape<128, 128, 64>,
+    cutlass::gemm::GemmShape<64, 64, 64>,
+    cutlass::gemm::GemmShape<16, 8, 32>,
+      cutlass::epilogue::thread::LinearCombination<
+          ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
+          ElementAccumulator, ElementAccumulator>,
+    cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 
+    4,   // Stages
+    16,  // AlignmentA
+    32,  // AlignmentB
+    cutlass::arch::OpMultiplyAddMixedInputUpcast,
+    cutlass::ComplexTransform::kNone,
+    cutlass::ComplexTransform::kNone
+  >;
+
+  EXPECT_TRUE(test::gemm::device::TestAllGemmUniversal<Gemm>());
+}
+////////////////////////////////////////////////////////////////////////////////
+
+#endif // #if defined(CUTLASS_ARCH_MMA_SM80_SUPPORTED)
+
+////////////////////////////////////////////////////////////////////////////////
diff --git a/test/unit/gemm/device/gemm_universal_s8t_s4n_s8t_mixed_input_tensor_op_s32_sm80.cu b/test/unit/gemm/device/gemm_universal_s8t_s4n_s8t_mixed_input_tensor_op_s32_sm80.cu
new file mode 100644
index 0000000000..89a52b3e80
--- /dev/null
+++ b/test/unit/gemm/device/gemm_universal_s8t_s4n_s8t_mixed_input_tensor_op_s32_sm80.cu
@@ -0,0 +1,95 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/*! \file
+    \brief Tests for device-wide GEMM interface
+    
+*/
+
+#include "../../common/cutlass_unit_test.h"
+#include "cutlass/cutlass.h"
+
+#include "cutlass/gemm/device/gemm_universal.h"
+
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/reference/host/gemm.h"
+#include "cutlass/util/reference/host/tensor_compare.h"
+#include "cutlass/util/reference/host/tensor_copy.h"
+#include "cutlass/util/reference/host/tensor_fill.h"
+#include "cutlass/util/tensor_view_io.h"
+
+#include "testbed_universal.h"
+
+////////////////////////////////////////////////////////////////////////////////
+
+#if defined(CUTLASS_ARCH_MMA_SM80_SUPPORTED)
+
+////////////////////////////////////////////////////////////////////////////////
+
+
+TEST(SM80_Device_GemmUniversal_s8t_s4n_s8t_mixed_input_tensor_op_s32, 128x128x64_64x64x64) {
+
+  using ElementA = int8_t;
+  using ElementB = cutlass::int4b_t;
+  using ElementOutput = int8_t;
+  using ElementAccumulator = int32_t;
+
+  using Gemm = cutlass::gemm::device::GemmUniversal<
+    ElementA, 
+    cutlass::layout::RowMajor, 
+    ElementB,
+    cutlass::layout::ColumnMajor, 
+    ElementOutput, 
+    cutlass::layout::RowMajor,
+    ElementAccumulator, 
+    cutlass::arch::OpClassTensorOp, 
+    cutlass::arch::Sm80,
+    cutlass::gemm::GemmShape<128, 128, 64>,
+    cutlass::gemm::GemmShape<64, 64, 64>,
+    cutlass::gemm::GemmShape<16, 8, 32>,
+      cutlass::epilogue::thread::LinearCombination<
+          ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
+          ElementAccumulator, ElementAccumulator>,
+    cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 
+    4,   // Stages
+    16,  // AlignmentA
+    32,  // AlignmentB
+    cutlass::arch::OpMultiplyAddMixedInputUpcast,
+    cutlass::ComplexTransform::kNone,
+    cutlass::ComplexTransform::kNone
+  >;
+
+  EXPECT_TRUE(test::gemm::device::TestAllGemmUniversal<Gemm>());
+}
+////////////////////////////////////////////////////////////////////////////////
+
+#endif // #if defined(CUTLASS_ARCH_MMA_SM80_SUPPORTED)
+
+////////////////////////////////////////////////////////////////////////////////
diff --git a/test/unit/gemm/device/gemm_universal_u8t_bf16n_bf16t_mixed_input_tensor_op_f32_sm80.cu b/test/unit/gemm/device/gemm_universal_u8t_bf16n_bf16t_mixed_input_tensor_op_f32_sm80.cu
new file mode 100644
index 0000000000..020c8b38f4
--- /dev/null
+++ b/test/unit/gemm/device/gemm_universal_u8t_bf16n_bf16t_mixed_input_tensor_op_f32_sm80.cu
@@ -0,0 +1,97 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/*! \file
+    \brief Tests for device-wide GEMM interface
+
+*/
+
+#include <iostream>
+
+#include "../../common/cutlass_unit_test.h"
+#include "cutlass/cutlass.h"
+
+#include "cutlass/gemm/device/gemm_universal.h"
+
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/reference/host/gemm.h"
+#include "cutlass/util/reference/host/tensor_compare.h"
+#include "cutlass/util/reference/host/tensor_copy.h"
+#include "cutlass/util/reference/host/tensor_fill.h"
+#include "cutlass/util/tensor_view_io.h"
+
+#include "testbed_universal.h"
+
+////////////////////////////////////////////////////////////////////////////////
+
+#if defined(CUTLASS_ARCH_MMA_SM80_SUPPORTED)
+
+////////////////////////////////////////////////////////////////////////////////
+
+
+TEST(SM80_Device_GemmUniversal_u8t_bf16n_bf16t_mixed_input_tensor_op_f32, 128x128x64_64x64x64) {
+
+  using ElementA = uint8_t;
+  using ElementB = cutlass::bfloat16_t;
+  using ElementOutput = cutlass::bfloat16_t;
+  using ElementAccumulator = float;
+
+  using Gemm = cutlass::gemm::device::GemmUniversal<
+    ElementA,
+    cutlass::layout::RowMajor,
+    ElementB,
+    cutlass::layout::ColumnMajor,
+    ElementOutput,
+    cutlass::layout::RowMajor,
+    ElementAccumulator,
+    cutlass::arch::OpClassTensorOp,
+    cutlass::arch::Sm80,
+    cutlass::gemm::GemmShape<128, 128, 64>,
+    cutlass::gemm::GemmShape<64, 64, 64>,
+    cutlass::gemm::GemmShape<16, 8, 16>,
+      cutlass::epilogue::thread::LinearCombination<
+          ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
+          ElementAccumulator, ElementAccumulator>,
+    cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>,
+    4,   // Stages
+    16,  // AlignmentA
+    8,   // AlignmentB
+    cutlass::arch::OpMultiplyAddMixedInputUpcast,
+    cutlass::ComplexTransform::kNone,
+    cutlass::ComplexTransform::kNone
+  >;
+
+  EXPECT_TRUE(test::gemm::device::TestAllGemmUniversal<Gemm>());
+}
+////////////////////////////////////////////////////////////////////////////////
+
+#endif // #if defined(CUTLASS_ARCH_MMA_SM80_SUPPORTED)
+
+////////////////////////////////////////////////////////////////////////////////
diff --git a/test/unit/gemm/device/gemm_universal_u8t_bf16n_f32t_mixed_input_tensor_op_f32_sm80.cu b/test/unit/gemm/device/gemm_universal_u8t_bf16n_f32t_mixed_input_tensor_op_f32_sm80.cu
new file mode 100644
index 0000000000..d6b65974c8
--- /dev/null
+++ b/test/unit/gemm/device/gemm_universal_u8t_bf16n_f32t_mixed_input_tensor_op_f32_sm80.cu
@@ -0,0 +1,97 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/*! \file
+    \brief Tests for device-wide GEMM interface
+    
+*/
+
+#include <iostream>
+
+#include "../../common/cutlass_unit_test.h"
+#include "cutlass/cutlass.h"
+
+#include "cutlass/gemm/device/gemm_universal.h"
+
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/reference/host/gemm.h"
+#include "cutlass/util/reference/host/tensor_compare.h"
+#include "cutlass/util/reference/host/tensor_copy.h"
+#include "cutlass/util/reference/host/tensor_fill.h"
+#include "cutlass/util/tensor_view_io.h"
+
+#include "testbed_universal.h"
+
+////////////////////////////////////////////////////////////////////////////////
+
+#if defined(CUTLASS_ARCH_MMA_SM80_SUPPORTED)
+
+////////////////////////////////////////////////////////////////////////////////
+
+
+TEST(SM80_Device_GemmUniversal_u8t_bf16n_f32t_mixed_input_tensor_op_f32, 128x128x64_64x64x64) {
+
+  using ElementA = uint8_t;
+  using ElementB = cutlass::bfloat16_t;
+  using ElementOutput = float;
+  using ElementAccumulator = float;
+
+  using Gemm = cutlass::gemm::device::GemmUniversal<
+    ElementA, 
+    cutlass::layout::RowMajor, 
+    ElementB,
+    cutlass::layout::ColumnMajor, 
+    ElementOutput, 
+    cutlass::layout::RowMajor,
+    ElementAccumulator, 
+    cutlass::arch::OpClassTensorOp, 
+    cutlass::arch::Sm80,
+    cutlass::gemm::GemmShape<128, 128, 64>,
+    cutlass::gemm::GemmShape<64, 64, 64>,
+    cutlass::gemm::GemmShape<16, 8, 16>,
+      cutlass::epilogue::thread::LinearCombination<
+          ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
+          ElementAccumulator, ElementAccumulator>,
+    cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 
+    4,   // Stages
+    16,  // AlignmentA
+    8,   // AlignmentB
+    cutlass::arch::OpMultiplyAddMixedInputUpcast,
+    cutlass::ComplexTransform::kNone,
+    cutlass::ComplexTransform::kNone
+  >;
+
+  EXPECT_TRUE(test::gemm::device::TestAllGemmUniversal<Gemm>());
+}
+////////////////////////////////////////////////////////////////////////////////
+
+#endif // #if defined(CUTLASS_ARCH_MMA_SM80_SUPPORTED)
+
+////////////////////////////////////////////////////////////////////////////////
diff --git a/test/unit/gemm/device/gemm_universal_u8t_f16n_f16t_mixed_input_tensor_op_f16_sm80.cu b/test/unit/gemm/device/gemm_universal_u8t_f16n_f16t_mixed_input_tensor_op_f16_sm80.cu
index 20a6b5a58c..41657c2fca 100644
--- a/test/unit/gemm/device/gemm_universal_u8t_f16n_f16t_mixed_input_tensor_op_f16_sm80.cu
+++ b/test/unit/gemm/device/gemm_universal_u8t_f16n_f16t_mixed_input_tensor_op_f16_sm80.cu
@@ -56,7 +56,7 @@
 ////////////////////////////////////////////////////////////////////////////////
 
 
-TEST(SM80_Device_GemmUniversal_u8t_f16t_f16t_mixed_input_tensor_op_f16, 128x128x64_64x64x64) {
+TEST(SM80_Device_GemmUniversal_u8t_f16n_f16t_mixed_input_tensor_op_f16, 128x128x64_64x64x64) {
 
   using ElementA = uint8_t;
   using ElementB = cutlass::half_t;
diff --git a/test/unit/gemm/device/gemm_universal_u8t_f16n_f16t_mixed_input_tensor_op_f32_sm80.cu b/test/unit/gemm/device/gemm_universal_u8t_f16n_f16t_mixed_input_tensor_op_f32_sm80.cu
new file mode 100644
index 0000000000..b2b3cd3a21
--- /dev/null
+++ b/test/unit/gemm/device/gemm_universal_u8t_f16n_f16t_mixed_input_tensor_op_f32_sm80.cu
@@ -0,0 +1,97 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/*! \file
+    \brief Tests for device-wide GEMM interface
+    
+*/
+
+#include <iostream>
+
+#include "../../common/cutlass_unit_test.h"
+#include "cutlass/cutlass.h"
+
+#include "cutlass/gemm/device/gemm_universal.h"
+
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/reference/host/gemm.h"
+#include "cutlass/util/reference/host/tensor_compare.h"
+#include "cutlass/util/reference/host/tensor_copy.h"
+#include "cutlass/util/reference/host/tensor_fill.h"
+#include "cutlass/util/tensor_view_io.h"
+
+#include "testbed_universal.h"
+
+////////////////////////////////////////////////////////////////////////////////
+
+#if defined(CUTLASS_ARCH_MMA_SM80_SUPPORTED)
+
+////////////////////////////////////////////////////////////////////////////////
+
+
+TEST(SM80_Device_GemmUniversal_u8t_f16n_f16t_mixed_input_tensor_op_f32, 128x128x64_64x64x64) {
+
+  using ElementA = uint8_t;
+  using ElementB = cutlass::half_t;
+  using ElementOutput = cutlass::half_t;
+  using ElementAccumulator = float;
+
+  using Gemm = cutlass::gemm::device::GemmUniversal<
+    ElementA, 
+    cutlass::layout::RowMajor, 
+    ElementB,
+    cutlass::layout::ColumnMajor, 
+    ElementOutput, 
+    cutlass::layout::RowMajor,
+    ElementAccumulator, 
+    cutlass::arch::OpClassTensorOp, 
+    cutlass::arch::Sm80,
+    cutlass::gemm::GemmShape<128, 128, 64>,
+    cutlass::gemm::GemmShape<64, 64, 64>,
+    cutlass::gemm::GemmShape<16, 8, 16>,
+      cutlass::epilogue::thread::LinearCombination<
+          ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
+          ElementAccumulator, ElementAccumulator>,
+    cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 
+    4,   // Stages
+    16,  // AlignmentA
+    8,   // AlignmentB
+    cutlass::arch::OpMultiplyAddMixedInputUpcast,
+    cutlass::ComplexTransform::kNone,
+    cutlass::ComplexTransform::kNone
+  >;
+
+  EXPECT_TRUE(test::gemm::device::TestAllGemmUniversal<Gemm>());
+}
+////////////////////////////////////////////////////////////////////////////////
+
+#endif // #if defined(CUTLASS_ARCH_MMA_SM80_SUPPORTED)
+
+////////////////////////////////////////////////////////////////////////////////
diff --git a/test/unit/gemm/device/gemm_universal_u8t_f16n_f32t_mixed_input_tensor_op_f32_sm80.cu b/test/unit/gemm/device/gemm_universal_u8t_f16n_f32t_mixed_input_tensor_op_f32_sm80.cu
new file mode 100644
index 0000000000..358c109e86
--- /dev/null
+++ b/test/unit/gemm/device/gemm_universal_u8t_f16n_f32t_mixed_input_tensor_op_f32_sm80.cu
@@ -0,0 +1,97 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/*! \file
+    \brief Tests for device-wide GEMM interface
+    
+*/
+
+#include <iostream>
+
+#include "../../common/cutlass_unit_test.h"
+#include "cutlass/cutlass.h"
+
+#include "cutlass/gemm/device/gemm_universal.h"
+
+#include "cutlass/util/host_tensor.h"
+#include "cutlass/util/reference/host/gemm.h"
+#include "cutlass/util/reference/host/tensor_compare.h"
+#include "cutlass/util/reference/host/tensor_copy.h"
+#include "cutlass/util/reference/host/tensor_fill.h"
+#include "cutlass/util/tensor_view_io.h"
+
+#include "testbed_universal.h"
+
+////////////////////////////////////////////////////////////////////////////////
+
+#if defined(CUTLASS_ARCH_MMA_SM80_SUPPORTED)
+
+////////////////////////////////////////////////////////////////////////////////
+
+
+TEST(SM80_Device_GemmUniversal_u8t_f16n_f32t_mixed_input_tensor_op_f32, 128x128x64_64x64x64) {
+
+  using ElementA = uint8_t;
+  using ElementB = cutlass::half_t;
+  using ElementOutput = float;
+  using ElementAccumulator = float;
+
+  using Gemm = cutlass::gemm::device::GemmUniversal<
+    ElementA, 
+    cutlass::layout::RowMajor, 
+    ElementB,
+    cutlass::layout::ColumnMajor, 
+    ElementOutput, 
+    cutlass::layout::RowMajor,
+    ElementAccumulator, 
+    cutlass::arch::OpClassTensorOp, 
+    cutlass::arch::Sm80,
+    cutlass::gemm::GemmShape<128, 128, 64>,
+    cutlass::gemm::GemmShape<64, 64, 64>,
+    cutlass::gemm::GemmShape<16, 8, 16>,
+      cutlass::epilogue::thread::LinearCombination<
+          ElementOutput, 128 / cutlass::sizeof_bits<ElementOutput>::value,
+          ElementAccumulator, ElementAccumulator>,
+    cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>, 
+    4,   // Stages
+    16,  // AlignmentA
+    8,   // AlignmentB
+    cutlass::arch::OpMultiplyAddMixedInputUpcast,
+    cutlass::ComplexTransform::kNone,
+    cutlass::ComplexTransform::kNone
+  >;
+
+  EXPECT_TRUE(test::gemm::device::TestAllGemmUniversal<Gemm>());
+}
+////////////////////////////////////////////////////////////////////////////////
+
+#endif // #if defined(CUTLASS_ARCH_MMA_SM80_SUPPORTED)
+
+////////////////////////////////////////////////////////////////////////////////
diff --git a/test/unit/gemm/device/multistage_testbed.h b/test/unit/gemm/device/multistage_testbed.h
index 2d6f37652b..2fc718648f 100644
--- a/test/unit/gemm/device/multistage_testbed.h
+++ b/test/unit/gemm/device/multistage_testbed.h
@@ -141,10 +141,10 @@ struct MultistageTestbed {
            ElementCompute alpha = ElementCompute(1),
            ElementCompute beta = ElementCompute(0)) {
 
-		// Waives test if CUDA device is insufficient
-		if (!sufficient()) {
-			return true;
-		}
+    // Waives test if CUDA device is insufficient
+    if (!sufficient()) {
+    	return true;
+    }
 
     //
     // Allocate the GEMM workspace
diff --git a/test/unit/gemm/device/sm90_evt_operations.hpp b/test/unit/gemm/device/sm90_evt_operations.hpp
index 201ba72539..73f228d4ee 100644
--- a/test/unit/gemm/device/sm90_evt_operations.hpp
+++ b/test/unit/gemm/device/sm90_evt_operations.hpp
@@ -37,41 +37,51 @@
 //////////////////////////////////////////////////////////////////////////////
 /// Host references used for testing
 namespace test::gemm::device {
-template<class Gemm, class NodeOp, class ...ChildOp>
-using HEVT = HostTreeVisitor<Gemm, NodeOp, ChildOp...>;
+template<class NodeOp, class ...ChildOp>
+using HEVT = HostTreeVisitor<NodeOp, ChildOp...>;
 
-template<class Gemm, class EdgeTuple, class ...Ops>
-using HDAG = HostTopoVisitor<Gemm, EdgeTuple, Ops...>;
+template<class EdgeTuple, class ...Ops>
+using HDAG = HostTopoVisitor<EdgeTuple, Ops...>;
 
-template<class Gemm, class InputTree, class OutputTree, class... AuxOutTrees>
-using HST = HostSplitTreeVisitor<Gemm, InputTree, OutputTree, AuxOutTrees...>;
+template<class InputTree, class OutputTree, class... AuxOutTrees>
+using HST = HostSplitTreeVisitor<InputTree, OutputTree, AuxOutTrees...>;
 
 /// D = alpha * acc + beta * C + AuxLoad
 template<class Gemm, class ElementAux, class LayoutAux>
 class HostEVTAuxLoad {
 public:
-  using ScalarAlpha = HostScalarBroadcast<Gemm, 1>;
-  using AccFetchNode = HostAccumulator<Gemm>;
-  using AuxLoadNode = HostAuxLoad<Gemm, false, ElementAux, LayoutAux>;
-  using TernaryCompute0 = HEVT<HostCompute<Gemm, cutlass::homogeneous_multiply_add>, ScalarAlpha, AccFetchNode, AuxLoadNode>;
-  using ScalarBeta = HostScalarBroadcast<Gemm, 1>;
-  using CLoadNode = HostAuxLoad<Gemm, true>;
-  using TernaryCompute1 = HEVT<HostCompute<Gemm, cutlass::homogeneous_multiply_add>, ScalarBeta, CLoadNode, TernaryCompute0>;
-  using EVTModule = HEVT<HostAuxStore<Gemm, true>, TernaryCompute1>;
+  using ElementC = typename Gemm::GemmKernel::ElementC;
+  using LayoutC = cutlass::detail::StrideToLayoutTagC_t<typename Gemm::GemmKernel::StrideC>;
+  using ElementD = typename Gemm::GemmKernel::ElementC;
+  using LayoutD = cutlass::detail::StrideToLayoutTagC_t<typename Gemm::GemmKernel::StrideD>;
+
+  using ScalarAlpha = HostScalarBroadcast<1>;
+  using AccFetchNode = HostAccumulator<>;
+  using AuxLoadNode = HostAuxLoad<ElementAux, LayoutAux, false>;
+  using TernaryCompute0 = HEVT<HostCompute<cutlass::homogeneous_multiply_add>, ScalarAlpha, AccFetchNode, AuxLoadNode>;
+  using ScalarBeta = HostScalarBroadcast<1>;
+  using CLoadNode = HostAuxLoad<ElementC, LayoutC, true>;
+  using TernaryCompute1 = HEVT<HostCompute<cutlass::homogeneous_multiply_add>, ScalarBeta, CLoadNode, TernaryCompute0>;
+  using EVTModule = HEVT<HostAuxStore<ElementD, LayoutD, true>, TernaryCompute1>;
 };
 
 /// D = alpha * acc + beta * C + per-column bias
 template<class Gemm, class ElementBias>
 class HostPerColBias {
 public:
-  using ScalarAlpha = HostScalarBroadcast<Gemm, 1>;
-  using AccFetchNode = HostAccumulator<Gemm>;
-  using RowBroadcastNode = HostRowBroadcast<Gemm, ElementBias>;
-  using TernaryCompute0 = HEVT<HostCompute<Gemm, cutlass::homogeneous_multiply_add>, ScalarAlpha, AccFetchNode, RowBroadcastNode>;
-  using ScalarBeta = HostScalarBroadcast<Gemm, 1>;
-  using CLoadNode = HostAuxLoad<Gemm, true>;
-  using TernaryCompute1 = HEVT<HostCompute<Gemm, cutlass::homogeneous_multiply_add>, ScalarBeta, CLoadNode, TernaryCompute0>;
-  using EVTModule = HEVT<HostAuxStore<Gemm, true>, TernaryCompute1>;
+  using ElementC = typename Gemm::GemmKernel::ElementC;
+  using LayoutC = cutlass::detail::StrideToLayoutTagC_t<typename Gemm::GemmKernel::StrideC>;
+  using ElementD = typename Gemm::GemmKernel::ElementC;
+  using LayoutD = cutlass::detail::StrideToLayoutTagC_t<typename Gemm::GemmKernel::StrideD>;
+
+  using ScalarAlpha = HostScalarBroadcast<1>;
+  using AccFetchNode = HostAccumulator<>;
+  using RowBroadcastNode = HostRowBroadcast<ElementBias>;
+  using TernaryCompute0 = HEVT<HostCompute<cutlass::homogeneous_multiply_add>, ScalarAlpha, AccFetchNode, RowBroadcastNode>;
+  using ScalarBeta = HostScalarBroadcast<1>;
+  using CLoadNode = HostAuxLoad<ElementC, LayoutC, true>;
+  using TernaryCompute1 = HEVT<HostCompute<cutlass::homogeneous_multiply_add>, ScalarBeta, CLoadNode, TernaryCompute0>;
+  using EVTModule = HEVT<HostAuxStore<ElementD, LayoutD, true>, TernaryCompute1>;
 };
 
 /// D = beta * C + Graph(relu(alpha * acc + aux) + aux)
@@ -79,11 +89,16 @@ class HostPerColBias {
 template<class Gemm>
 class HostEVTDAG {
 public:
-  using ScalarAlpha = HostScalarBroadcast<Gemm, 1>;
-  using AccFetchNode = HostAccumulator<Gemm>;
-  using AuxLoadNode = HostAuxLoad<Gemm, false, cutlass::half_t, cutlass::layout::RowMajor>;
+  using ElementC = typename Gemm::GemmKernel::ElementC;
+  using LayoutC = cutlass::detail::StrideToLayoutTagC_t<typename Gemm::GemmKernel::StrideC>;
+  using ElementD = typename Gemm::GemmKernel::ElementC;
+  using LayoutD = cutlass::detail::StrideToLayoutTagC_t<typename Gemm::GemmKernel::StrideD>;
+
+  using ScalarAlpha = HostScalarBroadcast<1>;
+  using AccFetchNode = HostAccumulator<>;
+  using AuxLoadNode = HostAuxLoad<cutlass::half_t, cutlass::layout::RowMajor, false>;
   using DAGNode = HDAG<
-    Gemm,
+    float,
     cute::tuple<
       cute::tuple<>, // 0. alpha
       cute::tuple<>, // 1. acc
@@ -95,14 +110,14 @@ class HostEVTDAG {
     ScalarAlpha,
     AccFetchNode,
     AuxLoadNode,
-    HostCompute<Gemm, cutlass::homogeneous_multiply_add>,
-    HostCompute<Gemm, cutlass::epilogue::thread::ReLu>,
-    HostCompute<Gemm, cutlass::plus>
+    HostCompute<cutlass::homogeneous_multiply_add>,
+    HostCompute<cutlass::epilogue::thread::ReLu>,
+    HostCompute<cutlass::plus>
   >;
-  using ScalarBeta = HostScalarBroadcast<Gemm, 1>;
-  using CLoadNode = HostAuxLoad<Gemm, true>;
-  using TernaryCompute1 = HEVT<HostCompute<Gemm, cutlass::homogeneous_multiply_add>, ScalarBeta, CLoadNode, DAGNode>;
-  using EVTModule = HEVT<HostAuxStore<Gemm, true>, TernaryCompute1>;
+  using ScalarBeta = HostScalarBroadcast<1>;
+  using CLoadNode = HostAuxLoad<ElementC, LayoutC, true>;
+  using TernaryCompute1 = HEVT<HostCompute<cutlass::homogeneous_multiply_add>, ScalarBeta, CLoadNode, DAGNode>;
+  using EVTModule = HEVT<HostAuxStore<ElementD, LayoutD, true>, TernaryCompute1>;
 };
 
 /// EVT = alpha * acc + C
@@ -111,19 +126,24 @@ class HostEVTDAG {
 template<class Gemm>
 class HostDAGEVT {
 public:
+  using ElementC = typename Gemm::GemmKernel::ElementC;
+  using LayoutC = cutlass::detail::StrideToLayoutTagC_t<typename Gemm::GemmKernel::StrideC>;
+  using ElementD = typename Gemm::GemmKernel::ElementC;
+  using LayoutD = cutlass::detail::StrideToLayoutTagC_t<typename Gemm::GemmKernel::StrideD>;
+
   using EVTNode = HEVT<
-    HostAuxStore<Gemm, false, cutlass::half_t, cutlass::layout::RowMajor>,
+    HostAuxStore<cutlass::half_t, cutlass::layout::RowMajor, false>,
     HEVT<
-      HostCompute<Gemm, cutlass::homogeneous_multiply_add>,
-      HostScalarBroadcast<Gemm, 2>,
-      HostAccumulator<Gemm>,
-      HostAuxLoad<Gemm, true>
+      HostCompute<cutlass::homogeneous_multiply_add>,
+      HostScalarBroadcast<2>,
+      HostAccumulator<>,
+      HostAuxLoad<ElementC, LayoutC, true>
     >
   >;
   using EVTModule = HEVT<
-    HostAuxStore<Gemm, true>,
+    HostAuxStore<ElementD, LayoutD, true>,
     HDAG<
-      Gemm,
+      float,
       cute::tuple<
       cute::tuple<>, // 0. EVT
       cute::tuple<>, // 1. per-row bias
@@ -131,25 +151,30 @@ class HostDAGEVT {
       cute::tuple<cute::_0, cute::_2> // 3. maximum(EVT + per-row bias, EVT)
       >,
       EVTNode,
-      HostColBroadcast<Gemm, cutlass::half_t>,
-      HostCompute<Gemm, cutlass::plus>,
-      HostCompute<Gemm, cutlass::maximum_with_default_nan_propagation>
+      HostColBroadcast<cutlass::half_t, cute::Stride<cute::_1,cute::_0,int>>,
+      HostCompute<cutlass::plus>,
+      HostCompute<cutlass::maximum_with_default_nan_propagation>
     >
   >;
 };
 
 /// Xreduce(alpha * acc + beta * C)
-template<class Gemm, template<class, template <class> class, class> class ReduceOp>
+template<class Gemm, class ReduceOp>
 class HostReduce {
 public:
-  using ScalarAlpha = HostScalarBroadcast<Gemm, 1>;
-  using AccFetchNode = HostAccumulator<Gemm>;
-  using BinaryCompute0 = HEVT<HostCompute<Gemm, cutlass::multiplies>, ScalarAlpha, AccFetchNode>;
-  using ScalarBeta = HostScalarBroadcast<Gemm, 1>;
-  using CLoadNode = HostAuxLoad<Gemm, true>;
-  using TernaryCompute1 = HEVT<HostCompute<Gemm, cutlass::homogeneous_multiply_add>, ScalarBeta, CLoadNode, BinaryCompute0>;
-  using ReduceNode = HEVT<ReduceOp<Gemm, cutlass::plus, float>, TernaryCompute1>;
-  using EVTModule = HEVT<HostAuxStore<Gemm, true>, ReduceNode>;
+  using ElementC = typename Gemm::GemmKernel::ElementC;
+  using LayoutC = cutlass::detail::StrideToLayoutTagC_t<typename Gemm::GemmKernel::StrideC>;
+  using ElementD = typename Gemm::GemmKernel::ElementC;
+  using LayoutD = cutlass::detail::StrideToLayoutTagC_t<typename Gemm::GemmKernel::StrideD>;
+
+  using ScalarAlpha = HostScalarBroadcast<1>;
+  using AccFetchNode = HostAccumulator<>;
+  using BinaryCompute0 = HEVT<HostCompute<cutlass::multiplies>, ScalarAlpha, AccFetchNode>;
+  using ScalarBeta = HostScalarBroadcast<1>;
+  using CLoadNode = HostAuxLoad<ElementC, LayoutC, true>;
+  using TernaryCompute1 = HEVT<HostCompute<cutlass::homogeneous_multiply_add>, ScalarBeta, CLoadNode, BinaryCompute0>;
+  using ReduceNode = HEVT<ReduceOp, TernaryCompute1>;
+  using EVTModule = HEVT<HostAuxStore<ElementD, LayoutD, true>, ReduceNode>;
 };
 
 // Z = scale_a * scale_b * alpha * acc + beta * scale_c * C + per-row bias
@@ -160,25 +185,29 @@ class HostReduce {
 template <class Gemm, template <class> class ActivationFn, class ElementD>
 class HostScaledLinCombPerRowBiasEltAct {
 public:
+  using ElementC = typename Gemm::GemmKernel::ElementC;
+  using LayoutC = cutlass::detail::StrideToLayoutTagC_t<typename Gemm::GemmKernel::StrideC>;
+  using LayoutD = cutlass::detail::StrideToLayoutTagC_t<typename Gemm::GemmKernel::StrideD>;
+
   using EVTModule = HEVT<
-  HostAuxStore<Gemm, true>,
+  HostAuxStore<ElementD, LayoutD, true>,
   HEVT<
-    HostCompute<Gemm, cutlass::epilogue::fusion::detail::ScaleOutOp<ElementD>::template Op>,  // activation(Z) * scaled_d
+    HostCompute<cutlass::epilogue::fusion::detail::ScaleOutOp<ElementD>::template Op>,  // activation(Z) * scaled_d
     HEVT<
-      HostCompute<Gemm, ActivationFn>, // activation(Z)
+      HostCompute<ActivationFn>, // activation(Z)
       HEVT<
-        HostCompute<Gemm, cutlass::homogeneous_multiply_add>,
-        HostScalarBroadcast<Gemm, 1, 2>, // scale_c * beta
-        HostAuxLoad<Gemm, true>, // C
+        HostCompute<cutlass::homogeneous_multiply_add>,
+        HostScalarBroadcast<1, 2, cute::Stride<cute::_0,cute::_0,int64_t>>, // scale_c * beta
+        HostAuxLoad<ElementC, LayoutC, true>, // C
         HEVT<
-          HostCompute<Gemm, cutlass::homogeneous_multiply_add>,
-          HostScalarBroadcast<Gemm, 1, 3>, // scale_a * scale_b * alpha
-          HostAccumulator<Gemm>,
-          HostColBroadcast<Gemm, ElementD>
+          HostCompute<cutlass::homogeneous_multiply_add>,
+          HostScalarBroadcast<1, 3, cute::Stride<cute::_0,cute::_0,int64_t>>, // scale_a * scale_b * alpha
+          HostAccumulator<>,
+          HostColBroadcast<ElementD, cute::Stride<cute::_1,cute::_0,int64_t>>
         >
       >
     >,
-    HostScalarBroadcast<Gemm, 1> // scale_d
+    HostScalarBroadcast<1> // scale_d
   >
   >;
 };
@@ -197,45 +226,49 @@ class HostScaledLinCombPerRowBiasEltAct {
 template <class Gemm, template <class> class ActivationFn, class ElementD, class ElementAux = ElementD>
 class HostScaledLinCombPerRowBiasEltActAmaxAux {
 public:
+  using ElementC = typename Gemm::GemmKernel::ElementC;
+  using LayoutC = cutlass::detail::StrideToLayoutTagC_t<typename Gemm::GemmKernel::StrideC>;
+  using LayoutD = cutlass::detail::StrideToLayoutTagC_t<typename Gemm::GemmKernel::StrideD>;
+
   template <typename T>
   using amax = cutlass::maximum_absolute_value_reduction<T, true>;
   using EVTModuleAuxFp8 = HEVT<
-    HostAuxStore<Gemm, true>,
-    HST<Gemm,
+    HostAuxStore<ElementD, LayoutD, true>,
+    HST<float,
       // Z = scale_a * scale_b * alpha * acc + scale_c * beta * C + per-row bias
       HEVT<
-        HostCompute<Gemm, cutlass::homogeneous_multiply_add>,
-        HostScalarBroadcast<Gemm, 1, 2>, // scale_c * beta
-        HostAuxLoad<Gemm, true>, // C
+        HostCompute<cutlass::homogeneous_multiply_add>,
+        HostScalarBroadcast<1, 2, cute::Stride<cute::_0,cute::_0,int64_t>>, // scale_c * beta
+        HostAuxLoad<ElementC, LayoutC, true>, // C
         HEVT<
-          HostCompute<Gemm, cutlass::homogeneous_multiply_add>,
-          HostScalarBroadcast<Gemm, 1, 3>, // scale_a * scale_b * alpha
-          HostAccumulator<Gemm>,
-          HostColBroadcast<Gemm, ElementD>
+          HostCompute<cutlass::homogeneous_multiply_add>,
+          HostScalarBroadcast<1, 3, cute::Stride<cute::_0,cute::_0,int64_t>>, // scale_a * scale_b * alpha
+          HostAccumulator<>,
+          HostColBroadcast<ElementD, cute::Stride<cute::_1,cute::_0,int64_t>>
         >
       >,
       // D = activation(Z) * scaled_d, amax_d = max(abs(elements in D))
       HEVT<
-        HostCompute<Gemm, cutlass::epilogue::fusion::detail::ScaleOutOp<ElementD>::template Op>,
+        HostCompute<cutlass::epilogue::fusion::detail::ScaleOutOp<ElementD>::template Op>,
         HEVT<
-          HostScalarReduce<Gemm, amax, float>,
+          HostScalarReduce<amax, float>,
           HEVT<
-            HostCompute<Gemm, ActivationFn>, //activation(Z) * scaled_d
-            HostAccumulator<Gemm> // Z
+            HostCompute<ActivationFn>, //activation(Z) * scaled_d
+            HostAccumulator<> // Z
           >
         >,
-        HostScalarBroadcast<Gemm, 1> // scale_d
+        HostScalarBroadcast<1> // scale_d
       >,
       // Aux = Z * scale_aux, amax_aux = max(abs(elements in Aux))
       HEVT<
-        HostAuxStore<Gemm, false, ElementAux, cutlass::layout::RowMajor>,
+        HostAuxStore<ElementAux, cutlass::layout::RowMajor, false>,
         HEVT<
-          HostCompute<Gemm, cutlass::multiplies>,
+          HostCompute<cutlass::multiplies>,
           HEVT<
-            HostScalarReduce<Gemm, amax, float>,
-            HostAccumulator<Gemm>
+            HostScalarReduce<amax, float>,
+            HostAccumulator<>
             >,
-          HostScalarBroadcast<Gemm, 1>
+          HostScalarBroadcast<1>
         >
       >
     >
@@ -243,32 +276,32 @@ class HostScaledLinCombPerRowBiasEltActAmaxAux {
 
   using EVTModuleAuxNotFp8 = HEVT<
     // D = activation(Z) * scaled_d, amax_d = max(abs(elements in D))
-    HostAuxStore<Gemm, true>,
+    HostAuxStore<ElementD, LayoutD, true>,
       HEVT<
-        HostCompute<Gemm, cutlass::epilogue::fusion::detail::ScaleOutOp<ElementD>::template Op>,
+        HostCompute<cutlass::epilogue::fusion::detail::ScaleOutOp<ElementD>::template Op>,
         HEVT<
-          HostScalarReduce<Gemm, amax, float>,
+          HostScalarReduce<amax, float>,
           HEVT<
-            HostCompute<Gemm, ActivationFn>, //activation(Z) * scaled_d
+            HostCompute<ActivationFn>, //activation(Z) * scaled_d
             HEVT<
               // Aux = Z
-              HostAuxStore<Gemm, false, ElementAux, cutlass::layout::RowMajor>,
+              HostAuxStore<ElementAux, cutlass::layout::RowMajor, false>,
               // Z = scale_a * scale_b * alpha * acc + scale_c * beta * C + per-row bias
               HEVT<
-                HostCompute<Gemm, cutlass::homogeneous_multiply_add>,
-                HostScalarBroadcast<Gemm, 1, 2>, // scale_c * beta
-                HostAuxLoad<Gemm, true>, // C
+                HostCompute<cutlass::homogeneous_multiply_add>,
+                HostScalarBroadcast<1, 2, cute::Stride<cute::_0,cute::_0,int64_t>>, // scale_c * beta
+                HostAuxLoad<ElementC, LayoutC, true>, // C
                 HEVT<
-                  HostCompute<Gemm, cutlass::homogeneous_multiply_add>,
-                  HostScalarBroadcast<Gemm, 1, 3>, // scale_a * scale_b * alpha
-                  HostAccumulator<Gemm>,
-                  HostColBroadcast<Gemm, ElementD>
+                  HostCompute<cutlass::homogeneous_multiply_add>,
+                  HostScalarBroadcast<1, 3, cute::Stride<cute::_0,cute::_0,int64_t>>, // scale_a * scale_b * alpha
+                  HostAccumulator<>,
+                  HostColBroadcast<ElementD, cute::Stride<cute::_1,cute::_0,int64_t>>
                 >
               >
             >
           >
         >,
-        HostScalarBroadcast<Gemm, 1> // scale_d
+        HostScalarBroadcast<1> // scale_d
       >
     >;
       
@@ -314,6 +347,27 @@ using Sm90LinCombAuxLoad =
     >
   >;
 
+//////////////////////////////////////////////////////////////////////////////
+/// D = alpha * acc + beta * C + AuxLoadNoSmem
+template<
+  class EpilogueDescriptor,
+  class ElementAux,
+  class StrideAux,
+  class ElementOutput,
+  class ElementCompute,
+  class ElementScalar = ElementCompute,
+  FloatRoundStyle RoundStyle = FloatRoundStyle::round_to_nearest
+>
+using Sm90LinCombAuxLoadNoSmem =
+  Sm90EVT<Sm90Compute<homogeneous_multiply_add, ElementOutput, ElementCompute, RoundStyle>, // beta * C + (alpha * acc + bias)
+    Sm90ScalarBroadcast<ElementScalar>, // beta
+    Sm90SrcFetch<ElementOutput>, // C
+    Sm90EVT<Sm90Compute<homogeneous_multiply_add, ElementCompute, ElementCompute, RoundStyle>, // alpha * acc + bias
+      Sm90ScalarBroadcast<ElementScalar>, // alpha
+      Sm90AccFetch, // acc
+      Sm90AuxLoad<0, void, ElementAux, StrideAux, void, void> // aux load
+    >
+  >;
 
 //////////////////////////////////////////////////////////////////////////////
 /// Example DAG
@@ -386,7 +440,7 @@ using Sm90LinCombDAGEVT =
         Sm90SrcFetch<ElementOutput>
       >
     >,
-    Sm90ColBroadcast<0, typename EpilogueDescriptor::TileShape, ElementBias>,
+    Sm90ColBroadcast<0, typename EpilogueDescriptor::TileShape, ElementBias, ElementCompute>,
     Sm90Compute<plus, ElementCompute, ElementCompute, RoundStyle>,
     Sm90Compute<detail::maximum_with_default_nan_propagation, ElementOutput, ElementCompute, RoundStyle>
   >;
@@ -409,7 +463,7 @@ using Sm90LinCombPerColumnBias =
     Sm90EVT<Sm90Compute<homogeneous_multiply_add, ElementCompute, ElementCompute, RoundStyle>, // alpha * acc + bias
       Sm90ScalarBroadcast<ElementScalar>, // alpha
       Sm90AccFetch, // acc
-      Sm90RowBroadcast<0, typename EpilogueDescriptor::TileShape, ElementBias>
+      Sm90RowBroadcast<0, typename EpilogueDescriptor::TileShape, ElementBias, ElementCompute>
     >
   >;
 
diff --git a/test/unit/gemm/device/sm90_gemm_f16_f16_f16_tensor_op_f32_cluster_warpspecialized_cooperative_aux_load.cu b/test/unit/gemm/device/sm90_gemm_f16_f16_f16_tensor_op_f32_cluster_warpspecialized_cooperative_aux_load.cu
index 967996d160..5ca84b1319 100644
--- a/test/unit/gemm/device/sm90_gemm_f16_f16_f16_tensor_op_f32_cluster_warpspecialized_cooperative_aux_load.cu
+++ b/test/unit/gemm/device/sm90_gemm_f16_f16_f16_tensor_op_f32_cluster_warpspecialized_cooperative_aux_load.cu
@@ -118,6 +118,60 @@ TEST(SM90_Device_Gemm_f16t_f16n_f32t_tensor_op_gmma_f32_cooperative_epilogue, 25
   EXPECT_TRUE(passed);
 }
 
+TEST(SM90_Device_Gemm_f16t_f16n_f32t_tensor_op_gmma_f32_cooperative_epilogue, 256x128x64_2x2x1_AuxLoadNoSmemF16_RowMajor) {
+  using LayoutA = cutlass::layout::RowMajor;
+  using LayoutB = cutlass::layout::ColumnMajor;
+  using LayoutC = cutlass::layout::RowMajor;
+  using TileShape_MNK = Shape<_256,_128,_64>;
+  using ClusterShape_MNK = Shape<_2,_2,_1>;
+
+  using EpilogueSchedule = cutlass::epilogue::TmaWarpSpecializedCooperative;
+  using EpilogueTileType = cutlass::epilogue::collective::EpilogueTileAuto;
+
+  using EpilogueDescriptor = cutlass::epilogue::collective::detail::EpilogueDescriptor<
+    TileShape_MNK, EpilogueTileType, cutlass::half_t, cutlass::half_t, EpilogueSchedule
+  >; 
+
+  using FusionCallbacks = cutlass::epilogue::fusion::Sm90LinCombAuxLoadNoSmem<
+    EpilogueDescriptor, cutlass::half_t, cutlass::layout::RowMajor, cutlass::half_t, float, float>;
+
+  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+      cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+      TileShape_MNK, ClusterShape_MNK,
+      EpilogueTileType,
+      float, float,
+      cutlass::half_t, LayoutC, 8,
+      cutlass::half_t, LayoutC, 8,
+      EpilogueSchedule,
+      FusionCallbacks
+    >::CollectiveOp;
+
+  using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+      cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+      cutlass::half_t, LayoutA, 8,
+      cutlass::half_t, LayoutB, 8,
+      float,
+      TileShape_MNK, ClusterShape_MNK,
+      cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+      cutlass::gemm::KernelTmaWarpSpecializedCooperative
+    >::CollectiveOp;
+
+  using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+      Shape<int,int,int,int>,
+      CollectiveMainloop,
+      CollectiveEpilogue
+  >;
+
+  using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+
+  // Host reference
+  using HostReference = test::gemm::device::HostEVTAuxLoad<
+    Gemm, cutlass::half_t, cutlass::layout::RowMajor
+  >;
+  bool passed = test::gemm::device::TestAllEVT<Gemm, HostReference>();
+  EXPECT_TRUE(passed);
+}
+
 TEST(SM90_Device_Gemm_f16t_f16n_f32t_tensor_op_gmma_f32_cooperative_epilogue, 256x128x64_2x2x1_AuxLoadF16_ColumnMajor) {
   using LayoutA = cutlass::layout::RowMajor;
   using LayoutB = cutlass::layout::ColumnMajor;
diff --git a/test/unit/gemm/device/sm90_gemm_f16_f16_f16_tensor_op_f32_cluster_warpspecialized_cooperative_aux_store.cu b/test/unit/gemm/device/sm90_gemm_f16_f16_f16_tensor_op_f32_cluster_warpspecialized_cooperative_aux_store.cu
index 81436adace..bf8b1fc7fe 100644
--- a/test/unit/gemm/device/sm90_gemm_f16_f16_f16_tensor_op_f32_cluster_warpspecialized_cooperative_aux_store.cu
+++ b/test/unit/gemm/device/sm90_gemm_f16_f16_f16_tensor_op_f32_cluster_warpspecialized_cooperative_aux_store.cu
@@ -329,6 +329,70 @@ TEST(SM90_Device_Gemm_f16t_f16n_f32t_tensor_op_gmma_f32_cooperative_epilogue, 25
   EXPECT_TRUE(passed);
 }
 
+TEST(SM90_Device_Gemm_f16t_f16n_f32t_tensor_op_gmma_f32_cooperative_epilogue, 256x128x64_2x2x1_VoidC_VoidD_AuxStoreNoSmemF16_RowMajor) {
+  using LayoutA = cutlass::layout::RowMajor;
+  using LayoutB = cutlass::layout::ColumnMajor;
+  using LayoutC = cutlass::layout::RowMajor;
+  using TileShape_MNK = Shape<_256,_128,_64>;
+  using ClusterShape_MNK = Shape<_2,_2,_1>;
+
+  using EpilogueSchedule = cutlass::epilogue::TmaWarpSpecializedCooperative;
+  using EpilogueTileType = cutlass::epilogue::collective::EpilogueTileAuto;
+
+  using EpilogueDescriptor = cutlass::epilogue::collective::detail::EpilogueDescriptor<
+    TileShape_MNK, EpilogueTileType, cutlass::half_t, cutlass::half_t, EpilogueSchedule
+  >;
+
+  using namespace cutlass::epilogue::fusion;
+
+  constexpr auto RoundStyle = cutlass::FloatRoundStyle::round_to_nearest;
+  constexpr bool has_c = false;
+
+  using EVT_D = decltype(test::gemm::device::select_evt_d<cutlass::half_t, float, has_c>());
+  using AuxStore = Sm90AuxStore<0, void, cutlass::half_t, RoundStyle, cutlass::layout::RowMajor, void, void>;
+
+  constexpr auto select_kernel = [](auto has_c, auto has_d) {
+    using FusionCallbacks =
+        cute::conditional_t<decltype(has_d){}, EVT_D, Sm90EVT<AuxStore, EVT_D>>;
+    using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+        cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+        TileShape_MNK, ClusterShape_MNK,
+        EpilogueTileType,
+        float, float,
+        cute::conditional_t<decltype(has_c){}, cutlass::half_t, void>, LayoutC, 8,
+        cute::conditional_t<decltype(has_d){}, cutlass::half_t, void>, LayoutC, 8,
+        EpilogueSchedule,
+        FusionCallbacks
+      >::CollectiveOp;
+    using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+        cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+        cutlass::half_t, LayoutA, 8,
+        cutlass::half_t, LayoutB, 8,
+        float,
+        TileShape_MNK, ClusterShape_MNK,
+        cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+        cutlass::gemm::KernelTmaWarpSpecializedCooperative
+      >::CollectiveOp;
+
+    using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+        Shape<int,int,int,int>,
+        CollectiveMainloop,
+        CollectiveEpilogue>;
+
+    return GemmKernel{};
+  };
+
+  using GemmKernel = decltype(select_kernel(cute::C<has_c>{}, cute::C<true>{}));
+  using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+
+  using GemmKernelWithoutD = decltype(select_kernel(cute::C<has_c>{}, cute::C<false>{}));
+  using GemmWithoutD = cutlass::gemm::device::GemmUniversalAdapter<GemmKernelWithoutD>;
+
+  bool passed = test::gemm::device::testEVTAuxStoreWithoutD<Gemm, GemmWithoutD>();
+
+  EXPECT_TRUE(passed);
+}
+
 TEST(SM90_Device_Gemm_f16t_f16n_f32n_tensor_op_gmma_f32_cooperative_epilogue, 256x128x64_2x2x1_VoidC_VoidD_AuxStoreF16_ColumnMajor) {
   using LayoutA = cutlass::layout::RowMajor;
   using LayoutB = cutlass::layout::ColumnMajor;
diff --git a/test/unit/gemm/device/sm90_gemm_f16_f16_f16_tensor_op_f32_cluster_warpspecialized_cooperative_reduce.cu b/test/unit/gemm/device/sm90_gemm_f16_f16_f16_tensor_op_f32_cluster_warpspecialized_cooperative_reduce.cu
index d035c7dea8..9e15c91de1 100644
--- a/test/unit/gemm/device/sm90_gemm_f16_f16_f16_tensor_op_f32_cluster_warpspecialized_cooperative_reduce.cu
+++ b/test/unit/gemm/device/sm90_gemm_f16_f16_f16_tensor_op_f32_cluster_warpspecialized_cooperative_reduce.cu
@@ -102,7 +102,7 @@ TEST(SM90_Device_Gemm_f16t_f16n_f32t_tensor_op_gmma_f32_cooperative_epilogue, 25
   using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
 
   // Host reference
-  using HostReference = test::gemm::device::HostReduce<Gemm, test::gemm::device::HostRowReduce>;
+  using HostReference = test::gemm::device::HostReduce<Gemm, test::gemm::device::HostRowReduce<cutlass::plus, float>>;
   bool passed = test::gemm::device::TestAllEVT<Gemm, HostReference>(true);
   EXPECT_TRUE(passed);
 }
@@ -148,7 +148,7 @@ TEST(SM90_Device_Gemm_f16t_f16n_f32t_tensor_op_gmma_f32_cooperative_epilogue, 25
   using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
 
   // Host reference
-  using HostReference = test::gemm::device::HostReduce<Gemm, test::gemm::device::HostColumnReduce>;
+  using HostReference = test::gemm::device::HostReduce<Gemm, test::gemm::device::HostColumnReduce<cutlass::plus, float>>;
   bool passed = test::gemm::device::TestAllEVT<Gemm, HostReference>(true);
   EXPECT_TRUE(passed);
 }
@@ -194,7 +194,7 @@ TEST(SM90_Device_Gemm_f16t_f16n_f32t_tensor_op_gmma_f32_cooperative_epilogue, 25
   using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
 
   // Host reference
-  using HostReference = test::gemm::device::HostReduce<Gemm, test::gemm::device::HostScalarReduce>;
+  using HostReference = test::gemm::device::HostReduce<Gemm, test::gemm::device::HostScalarReduce<cutlass::plus, float>>;
   bool passed = test::gemm::device::TestAllEVT<Gemm, HostReference>(true);
   EXPECT_TRUE(passed);
 }
diff --git a/test/unit/gemm/device/sm90_gemm_f16_f16_f16_tensor_op_f32_cluster_warpspecialized_pingpong_reduce.cu b/test/unit/gemm/device/sm90_gemm_f16_f16_f16_tensor_op_f32_cluster_warpspecialized_pingpong_reduce.cu
index 51c5c7a641..544b4fd5c3 100644
--- a/test/unit/gemm/device/sm90_gemm_f16_f16_f16_tensor_op_f32_cluster_warpspecialized_pingpong_reduce.cu
+++ b/test/unit/gemm/device/sm90_gemm_f16_f16_f16_tensor_op_f32_cluster_warpspecialized_pingpong_reduce.cu
@@ -102,7 +102,7 @@ TEST(SM90_Device_Gemm_f16t_f16n_f32t_tensor_op_gmma_f32_persistent_epilogue, 128
   using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
 
   // Host reference
-  using HostReference = test::gemm::device::HostReduce<Gemm, test::gemm::device::HostRowReduce>;
+  using HostReference = test::gemm::device::HostReduce<Gemm, test::gemm::device::HostRowReduce<cutlass::plus, float>>;
   bool passed = test::gemm::device::TestAllEVT<Gemm, HostReference>(true);
   EXPECT_TRUE(passed);
 }
@@ -148,7 +148,7 @@ TEST(SM90_Device_Gemm_f16t_f16n_f32t_tensor_op_gmma_f32_persistent_epilogue, 128
   using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
 
   // Host reference
-  using HostReference = test::gemm::device::HostReduce<Gemm, test::gemm::device::HostColumnReduce>;
+  using HostReference = test::gemm::device::HostReduce<Gemm, test::gemm::device::HostColumnReduce<cutlass::plus, float>>;
   bool passed = test::gemm::device::TestAllEVT<Gemm, HostReference>(true);
   EXPECT_TRUE(passed);
 }
@@ -194,7 +194,7 @@ TEST(SM90_Device_Gemm_f16t_f16n_f32t_tensor_op_gmma_f32_persistent_epilogue, 128
   using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
 
   // Host reference
-  using HostReference = test::gemm::device::HostReduce<Gemm, test::gemm::device::HostScalarReduce>;
+  using HostReference = test::gemm::device::HostReduce<Gemm, test::gemm::device::HostScalarReduce<cutlass::plus, float>>;
   bool passed = test::gemm::device::TestAllEVT<Gemm, HostReference>(true);
   EXPECT_TRUE(passed);
 }
diff --git a/test/unit/gemm/device/sm90_gemm_f16_f16_f16_tensor_op_f32_group_gemm.cu b/test/unit/gemm/device/sm90_gemm_f16_f16_f16_tensor_op_f32_group_gemm.cu
index b93d936865..7031a24348 100644
--- a/test/unit/gemm/device/sm90_gemm_f16_f16_f16_tensor_op_f32_group_gemm.cu
+++ b/test/unit/gemm/device/sm90_gemm_f16_f16_f16_tensor_op_f32_group_gemm.cu
@@ -78,7 +78,195 @@ constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;    // M
 using ElementAccumulator  = float;                                           // Element type for internal accumulation
 using ArchTag             = cutlass::arch::Sm90;                             // Tag indicating the minimum SM that supports the intended feature
 using OperatorClass       = cutlass::arch::OpClassTensorOp;                  // Operator class tag
-using TileShape           = Shape<_256,_128,_64>;                            // Threadblock-level tile size
+using TileShape           = Shape<_128,_128,_64>;                            // Threadblock-level tile size
+using ClusterShape        = Shape<_2,_2,_1>;                                 // Shape of the threadblocks in a cluster
+using StageCountType = cutlass::gemm::collective::StageCountAuto;            // Stage count maximized based on the tile size
+using KernelSchedule   = cutlass::gemm::KernelPtrArrayTmaWarpSpecializedCooperative;   // Kernel to launch
+using EpilogueSchedule = cutlass::epilogue::PtrArrayTmaWarpSpecializedCooperative;   // Epilogue to launch
+
+using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+    cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+    TileShape, ClusterShape,
+    cutlass::epilogue::collective::EpilogueTileAuto,
+    ElementAccumulator, ElementAccumulator,
+    ElementC, LayoutC *, AlignmentC,
+    ElementC, LayoutC *, AlignmentC,
+    EpilogueSchedule
+  >::CollectiveOp;
+
+using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    ElementA, LayoutA *, AlignmentA,
+    ElementB, LayoutB *, AlignmentB,
+    ElementAccumulator,
+    TileShape, ClusterShape,
+    cutlass::gemm::collective::StageCountAutoCarveout<
+      static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+    KernelSchedule
+  >::CollectiveOp;
+
+using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+    cutlass::gemm::GroupProblemShape<Shape<int,int,int>>,
+    CollectiveMainloop,
+    CollectiveEpilogue
+>;
+
+  using namespace test::gemm::device;
+  using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+  bool result = TestAll<Gemm>(1.0, 1.0);
+  EXPECT_TRUE(result);
+  result = TestAll<Gemm>(1.0, 0.0);
+  EXPECT_TRUE(result);
+}
+
+TEST(SM90_Device_Gemm_f16t_f16t_f32n_tensor_op_gmma_f32_group_gemm, 128x128x64_2x2x1_ReLu) {
+
+// A matrix configuration
+using         ElementA    = cutlass::half_t;                                // Element type for A matrix operand
+using         LayoutA     = cutlass::layout::RowMajor;                      // Layout type for A matrix operand
+constexpr int AlignmentA  = 128 / cutlass::sizeof_bits<ElementA>::value;    // Memory access granularity/alignment of A matrix in units of elements (up to 16 bytes)
+
+// B matrix configuration
+using         ElementB    = cutlass::half_t;                                // Element type for B matrix operand
+using         LayoutB     = cutlass::layout::ColumnMajor;                   // Layout type for B matrix operand
+constexpr int AlignmentB  = 128 / cutlass::sizeof_bits<ElementB>::value;    // Memory access granularity/alignment of B matrix in units of elements (up to 16 bytes)
+
+// C/D matrix configuration
+using         ElementC    = cutlass::half_t;                                // Element type for C and D matrix operands
+using         LayoutC     = cutlass::layout::ColumnMajor;                   // Layout type for C and D matrix operands
+constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;    // Memory access granularity/alignment of C matrix in units of elements (up to 16 bytes)
+
+// Core kernel configurations
+using ElementAccumulator  = float;                                           // Element type for internal accumulation
+using ArchTag             = cutlass::arch::Sm90;                             // Tag indicating the minimum SM that supports the intended feature
+using OperatorClass       = cutlass::arch::OpClassTensorOp;                  // Operator class tag
+using TileShape           = Shape<_128,_128,_64>;                            // Threadblock-level tile size
+using ClusterShape        = Shape<_2,_2,_1>;                                 // Shape of the threadblocks in a cluster
+using StageCountType = cutlass::gemm::collective::StageCountAuto;            // Stage count maximized based on the tile size
+using KernelSchedule   = cutlass::gemm::KernelPtrArrayTmaWarpSpecializedCooperative;   // Kernel to launch
+using EpilogueSchedule = cutlass::epilogue::PtrArrayTmaWarpSpecializedCooperative;   // Epilogue to launch
+
+using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+    cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+    TileShape, ClusterShape,
+    cutlass::epilogue::collective::EpilogueTileAuto,
+    ElementAccumulator, ElementAccumulator,
+    ElementC, LayoutC *, AlignmentC,
+    ElementC, LayoutC *, AlignmentC,
+    EpilogueSchedule,
+    cutlass::epilogue::fusion::LinCombEltAct<cutlass::epilogue::thread::ReLu, ElementC, ElementAccumulator>
+  >::CollectiveOp;
+
+using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    ElementA, LayoutA *, AlignmentA,
+    ElementB, LayoutB *, AlignmentB,
+    ElementAccumulator,
+    TileShape, ClusterShape,
+    cutlass::gemm::collective::StageCountAutoCarveout<
+      static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+    KernelSchedule
+  >::CollectiveOp;
+
+using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+    cutlass::gemm::GroupProblemShape<Shape<int,int,int>>,
+    CollectiveMainloop,
+    CollectiveEpilogue
+>;
+
+  using namespace test::gemm::device;
+  using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+  bool result = TestAll<Gemm>(1.0, 1.0);
+  EXPECT_TRUE(result);
+  result = TestAll<Gemm>(1.0, 0.0);
+  EXPECT_TRUE(result);
+}
+
+TEST(SM90_Device_Gemm_f16t_f16t_f32n_tensor_op_gmma_f32_group_gemm, 128x128x64_2x2x1_silu) {
+
+// A matrix configuration
+using         ElementA    = cutlass::half_t;                                // Element type for A matrix operand
+using         LayoutA     = cutlass::layout::RowMajor;                      // Layout type for A matrix operand
+constexpr int AlignmentA  = 128 / cutlass::sizeof_bits<ElementA>::value;    // Memory access granularity/alignment of A matrix in units of elements (up to 16 bytes)
+
+// B matrix configuration
+using         ElementB    = cutlass::half_t;                                // Element type for B matrix operand
+using         LayoutB     = cutlass::layout::ColumnMajor;                   // Layout type for B matrix operand
+constexpr int AlignmentB  = 128 / cutlass::sizeof_bits<ElementB>::value;    // Memory access granularity/alignment of B matrix in units of elements (up to 16 bytes)
+
+// C/D matrix configuration
+using         ElementC    = cutlass::half_t;                                // Element type for C and D matrix operands
+using         LayoutC     = cutlass::layout::ColumnMajor;                   // Layout type for C and D matrix operands
+constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;    // Memory access granularity/alignment of C matrix in units of elements (up to 16 bytes)
+
+// Core kernel configurations
+using ElementAccumulator  = float;                                           // Element type for internal accumulation
+using ArchTag             = cutlass::arch::Sm90;                             // Tag indicating the minimum SM that supports the intended feature
+using OperatorClass       = cutlass::arch::OpClassTensorOp;                  // Operator class tag
+using TileShape           = Shape<_128,_128,_64>;                            // Threadblock-level tile size
+using ClusterShape        = Shape<_2,_2,_1>;                                 // Shape of the threadblocks in a cluster
+using StageCountType = cutlass::gemm::collective::StageCountAuto;            // Stage count maximized based on the tile size
+using KernelSchedule   = cutlass::gemm::KernelPtrArrayTmaWarpSpecializedCooperative;   // Kernel to launch
+using EpilogueSchedule = cutlass::epilogue::PtrArrayTmaWarpSpecializedCooperative;   // Epilogue to launch
+
+using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+    cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+    TileShape, ClusterShape,
+    cutlass::epilogue::collective::EpilogueTileAuto,
+    ElementAccumulator, ElementAccumulator,
+    ElementC, LayoutC *, AlignmentC,
+    ElementC, LayoutC *, AlignmentC,
+    EpilogueSchedule,
+    cutlass::epilogue::fusion::LinCombEltAct<cutlass::epilogue::thread::SiLu, ElementC, ElementAccumulator>
+  >::CollectiveOp;
+
+using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    ElementA, LayoutA *, AlignmentA,
+    ElementB, LayoutB *, AlignmentB,
+    ElementAccumulator,
+    TileShape, ClusterShape,
+    cutlass::gemm::collective::StageCountAutoCarveout<
+      static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+    KernelSchedule
+  >::CollectiveOp;
+
+using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+    cutlass::gemm::GroupProblemShape<Shape<int,int,int>>,
+    CollectiveMainloop,
+    CollectiveEpilogue
+>;
+
+  using namespace test::gemm::device;
+  using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+  bool result = TestAll<Gemm>(1.0, 1.0);
+  EXPECT_TRUE(result);
+  result = TestAll<Gemm>(1.0, 0.0);
+  EXPECT_TRUE(result);
+}
+
+TEST(SM90_Device_Gemm_f16t_f16t_f32n_tensor_op_gmma_f32_group_gemm, 128x128x64_2x2x1_direct_store) {
+
+// A matrix configuration
+using         ElementA    = cutlass::half_t;                                // Element type for A matrix operand
+using         LayoutA     = cutlass::layout::RowMajor;                      // Layout type for A matrix operand
+constexpr int AlignmentA  = 128 / cutlass::sizeof_bits<ElementA>::value;    // Memory access granularity/alignment of A matrix in units of elements (up to 16 bytes)
+
+// B matrix configuration
+using         ElementB    = cutlass::half_t;                                // Element type for B matrix operand
+using         LayoutB     = cutlass::layout::ColumnMajor;                   // Layout type for B matrix operand
+constexpr int AlignmentB  = 128 / cutlass::sizeof_bits<ElementB>::value;    // Memory access granularity/alignment of B matrix in units of elements (up to 16 bytes)
+
+// C/D matrix configuration
+using         ElementC    = cutlass::half_t;                                // Element type for C and D matrix operands
+using         LayoutC     = cutlass::layout::ColumnMajor;                   // Layout type for C and D matrix operands
+constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;    // Memory access granularity/alignment of C matrix in units of elements (up to 16 bytes)
+
+// Core kernel configurations
+using ElementAccumulator  = float;                                           // Element type for internal accumulation
+using ArchTag             = cutlass::arch::Sm90;                             // Tag indicating the minimum SM that supports the intended feature
+using OperatorClass       = cutlass::arch::OpClassTensorOp;                  // Operator class tag
+using TileShape           = Shape<_128,_128,_64>;                            // Threadblock-level tile size
 using ClusterShape        = Shape<_2,_2,_1>;                                 // Shape of the threadblocks in a cluster
 using StageCountType = cutlass::gemm::collective::StageCountAuto;            // Stage count maximized based on the tile size
 using KernelSchedule   = cutlass::gemm::KernelPtrArrayTmaWarpSpecializedCooperative;   // Kernel to launch
@@ -115,6 +303,8 @@ using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
   using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
   bool result = TestAll<Gemm>(1.0, 1.0);
   EXPECT_TRUE(result);
+  result = TestAll<Gemm>(1.0, 0.0);
+  EXPECT_TRUE(result);
 }
 
 #endif // defined(CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90_SUPPORTED)
diff --git a/test/unit/gemm/device/sm90_gemm_f16_f16_f16_tensor_op_f32_group_gemm_pingpong.cu b/test/unit/gemm/device/sm90_gemm_f16_f16_f16_tensor_op_f32_group_gemm_pingpong.cu
new file mode 100644
index 0000000000..d387bb9fc6
--- /dev/null
+++ b/test/unit/gemm/device/sm90_gemm_f16_f16_f16_tensor_op_f32_group_gemm_pingpong.cu
@@ -0,0 +1,247 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/*! \file
+    \brief Tests for device-wide Ptr-Array Ping-pong scheduler GEMM interface
+*/
+
+#include <iostream>
+
+#include "cutlass/cutlass.h"
+#include "cute/tensor.hpp"
+#include "cute/atom/mma_atom.hpp"
+
+#include "cutlass/numeric_types.h"
+
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/gemm/kernel/tile_scheduler.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/epilogue/collective/sm70_epilogue_vectorized.hpp"
+#include "cutlass/epilogue/collective/default_epilogue.hpp"
+#include "cutlass/epilogue/thread/linear_combination.h"
+
+#include "../../common/cutlass_unit_test.h"
+
+#include "gemm_testbed_3x_ptr_array.hpp"
+
+#if defined(CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90_SUPPORTED)
+
+using namespace cute;
+
+TEST(SM90_Device_Gemm_f16t_f16t_f32n_tensor_op_gmma_f32_group_gemm_pingpong, 128x128x64_2x2x1) {
+
+// A matrix configuration
+using         ElementA    = cutlass::half_t;                                // Element type for A matrix operand
+using         LayoutA     = cutlass::layout::RowMajor;                      // Layout type for A matrix operand
+constexpr int AlignmentA  = 128 / cutlass::sizeof_bits<ElementA>::value;    // Memory access granularity/alignment of A matrix in units of elements (up to 16 bytes)
+
+// B matrix configuration
+using         ElementB    = cutlass::half_t;                                // Element type for B matrix operand
+using         LayoutB     = cutlass::layout::ColumnMajor;                   // Layout type for B matrix operand
+constexpr int AlignmentB  = 128 / cutlass::sizeof_bits<ElementB>::value;    // Memory access granularity/alignment of B matrix in units of elements (up to 16 bytes)
+
+// C/D matrix configuration
+using         ElementC    = cutlass::half_t;                                // Element type for C and D matrix operands
+using         LayoutC     = cutlass::layout::ColumnMajor;                   // Layout type for C and D matrix operands
+constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;    // Memory access granularity/alignment of C matrix in units of elements (up to 16 bytes)
+
+// Core kernel configurations
+using ElementAccumulator  = float;                                           // Element type for internal accumulation
+using ArchTag             = cutlass::arch::Sm90;                             // Tag indicating the minimum SM that supports the intended feature
+using OperatorClass       = cutlass::arch::OpClassTensorOp;                  // Operator class tag
+using TileShape           = Shape<_128,_128,_64>;                            // Threadblock-level tile size
+using ClusterShape        = Shape<_2,_2,_1>;                                 // Shape of the threadblocks in a cluster
+using StageCountType = cutlass::gemm::collective::StageCountAuto;            // Stage count maximized based on the tile size
+using KernelSchedule   = cutlass::gemm::KernelPtrArrayTmaWarpSpecializedPingpong;   // Kernel to launch
+using EpilogueSchedule = cutlass::epilogue::PtrArrayTmaWarpSpecializedPingpong;   // Epilogue to launch
+
+using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+    cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+    TileShape, ClusterShape,
+    cutlass::epilogue::collective::EpilogueTileAuto,
+    ElementAccumulator, ElementAccumulator,
+    ElementC, LayoutC *, AlignmentC,
+    ElementC, LayoutC *, AlignmentC,
+    EpilogueSchedule
+  >::CollectiveOp;
+
+using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    ElementA, LayoutA *, AlignmentA,
+    ElementB, LayoutB *, AlignmentB,
+    ElementAccumulator,
+    TileShape, ClusterShape,
+    cutlass::gemm::collective::StageCountAutoCarveout<
+      static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+    KernelSchedule
+  >::CollectiveOp;
+
+using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+    cutlass::gemm::GroupProblemShape<Shape<int,int,int>>,
+    CollectiveMainloop,
+    CollectiveEpilogue
+>;
+
+  using namespace test::gemm::device;
+  using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+  bool result = TestAll<Gemm>(1.0, 1.0);
+  EXPECT_TRUE(result);
+  result = TestAll<Gemm>(1.0, 0.0);
+  EXPECT_TRUE(result);
+}
+
+TEST(SM90_Device_Gemm_f16t_f16t_f32n_tensor_op_gmma_f32_group_gemm_pingpong, 128x128x64_2x2x1_gelu) {
+
+// A matrix configuration
+using         ElementA    = cutlass::half_t;                                // Element type for A matrix operand
+using         LayoutA     = cutlass::layout::RowMajor;                      // Layout type for A matrix operand
+constexpr int AlignmentA  = 128 / cutlass::sizeof_bits<ElementA>::value;    // Memory access granularity/alignment of A matrix in units of elements (up to 16 bytes)
+
+// B matrix configuration
+using         ElementB    = cutlass::half_t;                                // Element type for B matrix operand
+using         LayoutB     = cutlass::layout::ColumnMajor;                   // Layout type for B matrix operand
+constexpr int AlignmentB  = 128 / cutlass::sizeof_bits<ElementB>::value;    // Memory access granularity/alignment of B matrix in units of elements (up to 16 bytes)
+
+// C/D matrix configuration
+using         ElementC    = cutlass::half_t;                                // Element type for C and D matrix operands
+using         LayoutC     = cutlass::layout::ColumnMajor;                   // Layout type for C and D matrix operands
+constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;    // Memory access granularity/alignment of C matrix in units of elements (up to 16 bytes)
+
+// Core kernel configurations
+using ElementAccumulator  = float;                                           // Element type for internal accumulation
+using ArchTag             = cutlass::arch::Sm90;                             // Tag indicating the minimum SM that supports the intended feature
+using OperatorClass       = cutlass::arch::OpClassTensorOp;                  // Operator class tag
+using TileShape           = Shape<_128,_128,_64>;                            // Threadblock-level tile size
+using ClusterShape        = Shape<_2,_2,_1>;                                 // Shape of the threadblocks in a cluster
+using StageCountType = cutlass::gemm::collective::StageCountAuto;            // Stage count maximized based on the tile size
+using KernelSchedule   = cutlass::gemm::KernelPtrArrayTmaWarpSpecializedPingpong;   // Kernel to launch
+using EpilogueSchedule = cutlass::epilogue::PtrArrayTmaWarpSpecializedPingpong;   // Epilogue to launch
+
+using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+    cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+    TileShape, ClusterShape,
+    cutlass::epilogue::collective::EpilogueTileAuto,
+    ElementAccumulator, ElementAccumulator,
+    ElementC, LayoutC *, AlignmentC,
+    ElementC, LayoutC *, AlignmentC,
+    EpilogueSchedule,
+    cutlass::epilogue::fusion::LinCombEltAct<cutlass::epilogue::thread::GELU, ElementC, ElementAccumulator>
+  >::CollectiveOp;
+
+using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    ElementA, LayoutA *, AlignmentA,
+    ElementB, LayoutB *, AlignmentB,
+    ElementAccumulator,
+    TileShape, ClusterShape,
+    cutlass::gemm::collective::StageCountAutoCarveout<
+      static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+    KernelSchedule
+  >::CollectiveOp;
+
+using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+    cutlass::gemm::GroupProblemShape<Shape<int,int,int>>,
+    CollectiveMainloop,
+    CollectiveEpilogue
+>;
+
+  using namespace test::gemm::device;
+  using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+  bool result = TestAll<Gemm>(1.0, 1.0);
+  EXPECT_TRUE(result);
+  result = TestAll<Gemm>(1.0, 0.0);
+  EXPECT_TRUE(result);
+}
+
+TEST(SM90_Device_Gemm_f16t_f16t_f32n_tensor_op_gmma_f32_group_gemm_pingpong, 128x128x64_2x2x1_direct_store) {
+
+// A matrix configuration
+using         ElementA    = cutlass::half_t;                                // Element type for A matrix operand
+using         LayoutA     = cutlass::layout::RowMajor;                      // Layout type for A matrix operand
+constexpr int AlignmentA  = 128 / cutlass::sizeof_bits<ElementA>::value;    // Memory access granularity/alignment of A matrix in units of elements (up to 16 bytes)
+
+// B matrix configuration
+using         ElementB    = cutlass::half_t;                                // Element type for B matrix operand
+using         LayoutB     = cutlass::layout::ColumnMajor;                   // Layout type for B matrix operand
+constexpr int AlignmentB  = 128 / cutlass::sizeof_bits<ElementB>::value;    // Memory access granularity/alignment of B matrix in units of elements (up to 16 bytes)
+
+// C/D matrix configuration
+using         ElementC    = cutlass::half_t;                                // Element type for C and D matrix operands
+using         LayoutC     = cutlass::layout::ColumnMajor;                   // Layout type for C and D matrix operands
+constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;    // Memory access granularity/alignment of C matrix in units of elements (up to 16 bytes)
+
+// Core kernel configurations
+using ElementAccumulator  = float;                                           // Element type for internal accumulation
+using ArchTag             = cutlass::arch::Sm90;                             // Tag indicating the minimum SM that supports the intended feature
+using OperatorClass       = cutlass::arch::OpClassTensorOp;                  // Operator class tag
+using TileShape           = Shape<_128,_128,_64>;                            // Threadblock-level tile size
+using ClusterShape        = Shape<_2,_2,_1>;                                 // Shape of the threadblocks in a cluster
+using StageCountType = cutlass::gemm::collective::StageCountAuto;            // Stage count maximized based on the tile size
+using KernelSchedule   = cutlass::gemm::KernelPtrArrayTmaWarpSpecializedPingpong;   // Kernel to launch
+using EpilogueSchedule = cutlass::epilogue::PtrArrayNoSmemWarpSpecialized;             // Epilogue to launch
+
+using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+    cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+    TileShape, ClusterShape,
+    cutlass::epilogue::collective::EpilogueTileAuto,
+    ElementAccumulator, ElementAccumulator,
+    ElementC, LayoutC *, AlignmentC,
+    ElementC, LayoutC *, AlignmentC,
+    EpilogueSchedule
+  >::CollectiveOp;
+
+using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    ElementA, LayoutA *, AlignmentA,
+    ElementB, LayoutB *, AlignmentB,
+    ElementAccumulator,
+    TileShape, ClusterShape,
+    cutlass::gemm::collective::StageCountAutoCarveout<
+      static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+    KernelSchedule
+  >::CollectiveOp;
+
+using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+    cutlass::gemm::GroupProblemShape<Shape<int,int,int>>,
+    CollectiveMainloop,
+    CollectiveEpilogue
+>;
+
+  using namespace test::gemm::device;
+  using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+  bool result = TestAll<Gemm>(1.0, 1.0);
+  EXPECT_TRUE(result);
+  result = TestAll<Gemm>(1.0, 0.0);
+  EXPECT_TRUE(result);
+}
+
+#endif // defined(CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90_SUPPORTED)
diff --git a/test/unit/gemm/device/sm90_gemm_f16_f16_f16_tensor_op_f32_ptr_array.cu b/test/unit/gemm/device/sm90_gemm_f16_f16_f16_tensor_op_f32_ptr_array.cu
index dc581acf7f..53748dc81c 100644
--- a/test/unit/gemm/device/sm90_gemm_f16_f16_f16_tensor_op_f32_ptr_array.cu
+++ b/test/unit/gemm/device/sm90_gemm_f16_f16_f16_tensor_op_f32_ptr_array.cu
@@ -115,9 +115,11 @@ using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
   using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
   bool result = TestAll<Gemm>(1.0, 1.0);
   EXPECT_TRUE(result);
+  result = TestAll<Gemm>(1.0, 0.0);
+  EXPECT_TRUE(result);
 }
 
-TEST(SM90_Device_Gemm_f16t_f16t_f32n_tensor_op_gmma_f32_ptr_array, 128x128x64_2x2x1_NoSmemEpi) {
+TEST(SM90_Device_Gemm_f16t_f16t_f32n_tensor_op_gmma_f32_ptr_array, 128x128x64_2x2x1_direct_store) {
 
 // A matrix configuration
 using         ElementA    = cutlass::half_t;                                // Element type for A matrix operand
@@ -173,6 +175,7 @@ using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
 
   using namespace test::gemm::device;
   using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+  EXPECT_TRUE(TestAll<Gemm>(1.0, 1.0));
   EXPECT_TRUE(TestAll<Gemm>(1.0, 0.0));
 }
 
diff --git a/test/unit/gemm/device/sm90_gemm_f16_f16_f16_tensor_op_f32_ptr_array_pingpong.cu b/test/unit/gemm/device/sm90_gemm_f16_f16_f16_tensor_op_f32_ptr_array_pingpong.cu
new file mode 100644
index 0000000000..3e8be48625
--- /dev/null
+++ b/test/unit/gemm/device/sm90_gemm_f16_f16_f16_tensor_op_f32_ptr_array_pingpong.cu
@@ -0,0 +1,182 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/*! \file
+    \brief Tests for device-wide Ptr-Array GEMM interface
+*/
+
+#include <iostream>
+
+#include "cutlass/cutlass.h"
+#include "cute/tensor.hpp"
+#include "cute/atom/mma_atom.hpp"
+
+#include "cutlass/numeric_types.h"
+
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/gemm/kernel/tile_scheduler.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/epilogue/collective/sm70_epilogue_vectorized.hpp"
+#include "cutlass/epilogue/collective/default_epilogue.hpp"
+#include "cutlass/epilogue/thread/linear_combination.h"
+
+#include "../../common/cutlass_unit_test.h"
+
+#include "gemm_testbed_3x_ptr_array.hpp"
+
+#if defined(CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90_SUPPORTED)
+
+using namespace cute;
+
+TEST(SM90_Device_Gemm_f16t_f16t_f32n_tensor_op_gmma_f32_ptr_array_pingpong, 128x128x64_2x2x1) {
+
+// A matrix configuration
+using         ElementA    = cutlass::half_t;                                // Element type for A matrix operand
+using         LayoutA     = cutlass::layout::RowMajor;                      // Layout type for A matrix operand
+constexpr int AlignmentA  = 128 / cutlass::sizeof_bits<ElementA>::value;    // Memory access granularity/alignment of A matrix in units of elements (up to 16 bytes)
+
+// B matrix configuration
+using         ElementB    = cutlass::half_t;                                // Element type for B matrix operand
+using         LayoutB     = cutlass::layout::ColumnMajor;                   // Layout type for B matrix operand
+constexpr int AlignmentB  = 128 / cutlass::sizeof_bits<ElementB>::value;    // Memory access granularity/alignment of B matrix in units of elements (up to 16 bytes)
+
+// C/D matrix configuration
+using         ElementC    = cutlass::half_t;                                // Element type for C and D matrix operands
+using         LayoutC     = cutlass::layout::ColumnMajor;                   // Layout type for C and D matrix operands
+constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;    // Memory access granularity/alignment of C matrix in units of elements (up to 16 bytes)
+
+// Core kernel configurations
+using ElementAccumulator  = float;                                           // Element type for internal accumulation
+using ArchTag             = cutlass::arch::Sm90;                             // Tag indicating the minimum SM that supports the intended feature
+using OperatorClass       = cutlass::arch::OpClassTensorOp;                  // Operator class tag
+using TileShape           = Shape<_128,_128,_64>;                             // Threadblock-level tile size
+using ClusterShape        = Shape<_2,_2,_1>;                                 // Shape of the threadblocks in a cluster
+using StageCountType = cutlass::gemm::collective::StageCountAuto;            // Stage count maximized based on the tile size
+using KernelSchedule   = cutlass::gemm::KernelPtrArrayTmaWarpSpecializedPingpong;   // Kernel to launch
+using EpilogueSchedule = cutlass::epilogue::PtrArrayTmaWarpSpecializedPingpong;     // Epilogue to launch
+
+using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+    cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+    TileShape, ClusterShape,
+    cutlass::epilogue::collective::EpilogueTileAuto,
+    ElementAccumulator, ElementAccumulator,
+    ElementC, LayoutC, AlignmentC,
+    ElementC, LayoutC, AlignmentC,
+    EpilogueSchedule
+  >::CollectiveOp;
+
+using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    ElementA, LayoutA, AlignmentA,
+    ElementB, LayoutB, AlignmentB,
+    ElementAccumulator,
+    TileShape, ClusterShape,
+    cutlass::gemm::collective::StageCountAutoCarveout<
+      static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+    KernelSchedule
+  >::CollectiveOp;
+
+using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+    cutlass::gemm::ArrayProblemShape<Shape<int,int,int,int>>,
+    CollectiveMainloop,
+    CollectiveEpilogue
+>;
+
+  using namespace test::gemm::device;
+  using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+  bool result = TestAll<Gemm>(1.0, 1.0);
+  EXPECT_TRUE(result);
+  result = TestAll<Gemm>(1.0, 0.0);
+  EXPECT_TRUE(result);
+}
+
+TEST(SM90_Device_Gemm_f16t_f16t_f32n_tensor_op_gmma_f32_ptr_array_pingpong, 128x128x64_2x2x1_direct_store) {
+
+// A matrix configuration
+using         ElementA    = cutlass::half_t;                                // Element type for A matrix operand
+using         LayoutA     = cutlass::layout::RowMajor;                      // Layout type for A matrix operand
+constexpr int AlignmentA  = 128 / cutlass::sizeof_bits<ElementA>::value;    // Memory access granularity/alignment of A matrix in units of elements (up to 16 bytes)
+
+// B matrix configuration
+using         ElementB    = cutlass::half_t;                                // Element type for B matrix operand
+using         LayoutB     = cutlass::layout::ColumnMajor;                   // Layout type for B matrix operand
+constexpr int AlignmentB  = 128 / cutlass::sizeof_bits<ElementB>::value;    // Memory access granularity/alignment of B matrix in units of elements (up to 16 bytes)
+
+// C/D matrix configuration
+using         ElementC    = cutlass::half_t;                                // Element type for C and D matrix operands
+using         LayoutC     = cutlass::layout::ColumnMajor;                   // Layout type for C and D matrix operands
+constexpr int AlignmentC  = 128 / cutlass::sizeof_bits<ElementC>::value;    // Memory access granularity/alignment of C matrix in units of elements (up to 16 bytes)
+
+// Core kernel configurations
+using ElementAccumulator  = float;                                           // Element type for internal accumulation
+using ArchTag             = cutlass::arch::Sm90;                             // Tag indicating the minimum SM that supports the intended feature
+using OperatorClass       = cutlass::arch::OpClassTensorOp;                  // Operator class tag
+using TileShape           = Shape<_128,_128,_64>;                             // Threadblock-level tile size
+using ClusterShape        = Shape<_2,_2,_1>;                                 // Shape of the threadblocks in a cluster
+using StageCountType = cutlass::gemm::collective::StageCountAuto;            // Stage count maximized based on the tile size
+using KernelSchedule   = cutlass::gemm::KernelPtrArrayTmaWarpSpecializedPingpong;   // Kernel to launch
+using EpilogueSchedule = cutlass::epilogue::PtrArrayNoSmemWarpSpecialized;             // Epilogue to launch
+
+using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+    cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+    TileShape, ClusterShape,
+    cutlass::epilogue::collective::EpilogueTileAuto,
+    ElementAccumulator, ElementAccumulator,
+    ElementC, LayoutC, AlignmentC,
+    ElementC, LayoutC, AlignmentC,
+    EpilogueSchedule
+  >::CollectiveOp;
+
+using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+    ArchTag, OperatorClass,
+    ElementA, LayoutA, AlignmentA,
+    ElementB, LayoutB, AlignmentB,
+    ElementAccumulator,
+    TileShape, ClusterShape,
+    cutlass::gemm::collective::StageCountAutoCarveout<
+      static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+    KernelSchedule
+  >::CollectiveOp;
+
+using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+    cutlass::gemm::ArrayProblemShape<Shape<int,int,int,int>>,
+    CollectiveMainloop,
+    CollectiveEpilogue
+>;
+
+  using namespace test::gemm::device;
+  using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+  EXPECT_TRUE(TestAll<Gemm>(1.0, 1.0));
+  EXPECT_TRUE(TestAll<Gemm>(1.0, 0.0));
+}
+
+#endif // defined(CUTLASS_ARCH_MMA_MODIFIABLE_TMA_SM90_SUPPORTED)
diff --git a/test/unit/gemm/device/sm90_gemm_f16_f16_f32_tensor_op_f32_rs_cluster_warpspecialized_cooperative.cu b/test/unit/gemm/device/sm90_gemm_f16_f16_f32_tensor_op_f32_rs_cluster_warpspecialized_cooperative.cu
index 3d6b493928..928e20a2d1 100644
--- a/test/unit/gemm/device/sm90_gemm_f16_f16_f32_tensor_op_f32_rs_cluster_warpspecialized_cooperative.cu
+++ b/test/unit/gemm/device/sm90_gemm_f16_f16_f32_tensor_op_f32_rs_cluster_warpspecialized_cooperative.cu
@@ -83,12 +83,12 @@ TEST(SM90_Device_Gemm_f16t_f16n_f32n_tensor_op_gmma_f32_cooperative, 128x192x64_
 
   using KernelScheduleType = cutlass::gemm::KernelTmaWarpSpecializedCooperative;
   using AtomLayoutMNK = Layout<Shape<_2,_1,_1>>;
-  using TiledMma = decltype(cute::make_tiled_mma(cute::GMMA::rs_op_selector<
+  using TiledMma = decltype(cute::make_tiled_mma(GMMA::rs_op_selector<
       ElementA, ElementB, ElementAccumulator, TileShape_MNK, GMMA::Major::K, GMMA::Major::K>(), AtomLayoutMNK{}));
   using GmemTiledCopyA = decltype(cutlass::gemm::collective::detail::sm90_cluster_shape_to_tma_atom(shape<1>(ClusterShape_MNK{})));
   using GmemTiledCopyB = decltype(cutlass::gemm::collective::detail::sm90_cluster_shape_to_tma_atom(shape<0>(ClusterShape_MNK{})));
-  static constexpr cute::GMMA::Major GmmaMajorA = cutlass::gemm::collective::detail::gmma_rs_tag_to_major_A<LayoutA>();
-  static constexpr cute::GMMA::Major GmmaMajorB = cutlass::gemm::collective::detail::gmma_rs_tag_to_major_B<LayoutB>();
+  static constexpr GMMA::Major GmmaMajorA = cutlass::gemm::collective::detail::gmma_rs_tag_to_major_A<LayoutA>();
+  static constexpr GMMA::Major GmmaMajorB = cutlass::gemm::collective::detail::gmma_rs_tag_to_major_B<LayoutB>();
   using SmemLayoutAtomA = decltype(cutlass::gemm::collective::detail::rs_smem_selector<GmmaMajorA, ElementA,
       decltype(cute::get<0>(TileShape_MNK{})), decltype(cute::get<2>(TileShape_MNK{})), false>());
   using SmemLayoutAtomB = decltype(cutlass::gemm::collective::detail::rs_smem_selector<GmmaMajorB, ElementB,
@@ -159,12 +159,12 @@ TEST(SM90_Device_Gemm_f16t_f16n_f32n_tensor_op_gmma_f32_cooperative, 128x192x64_
 
   using KernelScheduleType = cutlass::gemm::KernelTmaWarpSpecializedCooperative;
   using AtomLayoutMNK = Layout<Shape<_2,_1,_1>>;
-  using TiledMma = decltype(cute::make_tiled_mma(cute::GMMA::rs_op_selector<
+  using TiledMma = decltype(cute::make_tiled_mma(GMMA::rs_op_selector<
       ElementA, ElementB, ElementAccumulator, TileShape_MNK, GMMA::Major::K, GMMA::Major::K>(), AtomLayoutMNK{}));
   using GmemTiledCopyA = decltype(cutlass::gemm::collective::detail::sm90_cluster_shape_to_tma_atom(shape<1>(ClusterShape_MNK{})));
   using GmemTiledCopyB = decltype(cutlass::gemm::collective::detail::sm90_cluster_shape_to_tma_atom(shape<0>(ClusterShape_MNK{})));
-  static constexpr cute::GMMA::Major GmmaMajorA = cutlass::gemm::collective::detail::gmma_rs_tag_to_major_A<LayoutA>();
-  static constexpr cute::GMMA::Major GmmaMajorB = cutlass::gemm::collective::detail::gmma_rs_tag_to_major_B<LayoutB>();
+  static constexpr GMMA::Major GmmaMajorA = cutlass::gemm::collective::detail::gmma_rs_tag_to_major_A<LayoutA>();
+  static constexpr GMMA::Major GmmaMajorB = cutlass::gemm::collective::detail::gmma_rs_tag_to_major_B<LayoutB>();
   using SmemLayoutAtomA = decltype(cutlass::gemm::collective::detail::rs_smem_selector<GmmaMajorA, ElementA,
       decltype(cute::get<0>(TileShape_MNK{})), decltype(cute::get<2>(TileShape_MNK{})), false>());
   using SmemLayoutAtomB = decltype(cutlass::gemm::collective::detail::rs_smem_selector<GmmaMajorB, ElementB,
diff --git a/test/unit/gemm/device/sm90_gemm_f32_f32_f32_tensor_op_f32.cu b/test/unit/gemm/device/sm90_gemm_f32_f32_f32_tensor_op_f32.cu
index 56b85846de..3b911f1fc5 100644
--- a/test/unit/gemm/device/sm90_gemm_f32_f32_f32_tensor_op_f32.cu
+++ b/test/unit/gemm/device/sm90_gemm_f32_f32_f32_tensor_op_f32.cu
@@ -186,13 +186,13 @@ TEST(SM90_Device_Gemm_f32t_f32t_f32n_tensor_op_gmma_f32, 128x128x32_1x1x1_cooper
       cutlass::detail::TagToStrideA_t<LayoutA>,
       float,
       cutlass::detail::TagToStrideB_t<LayoutB>,
-      decltype(cute::make_tiled_mma(cute::SM90_64x64x8_F32TF32TF32_SS_TN{}, Layout<Shape<_2,_1,_1>>{})),
+      decltype(cute::make_tiled_mma(cute::SM90_64x64x8_F32TF32TF32_SS_TN<>{}, Layout<Shape<_2,_1,_1>>{})),
       cute::SM90_TMA_LOAD,
-      cute::GMMA::Layout_K_SW128_Atom<tfloat32_t>,
+      GMMA::Layout_K_SW128_Atom<tfloat32_t>,
       void,
       cute::identity,
       cute::SM90_TMA_LOAD,
-      cute::GMMA::Layout_K_SW128_Atom<tfloat32_t>,
+      GMMA::Layout_K_SW128_Atom<tfloat32_t>,
       void,
       cute::identity
     >;
diff --git a/test/unit/gemm/device/sm90_gemm_f8_f8_f32_tensor_op_f32_rs_cluster_warpspecialized_cooperative.cu b/test/unit/gemm/device/sm90_gemm_f8_f8_f32_tensor_op_f32_rs_cluster_warpspecialized_cooperative.cu
index ce6ad081d4..86414021f2 100644
--- a/test/unit/gemm/device/sm90_gemm_f8_f8_f32_tensor_op_f32_rs_cluster_warpspecialized_cooperative.cu
+++ b/test/unit/gemm/device/sm90_gemm_f8_f8_f32_tensor_op_f32_rs_cluster_warpspecialized_cooperative.cu
@@ -83,12 +83,12 @@ TEST(SM90_Device_Gemm_e4m3t_e4m3n_f32n_tensor_op_gmma_f32_cooperative, 128x128x1
 
   using KernelScheduleType = cutlass::gemm::KernelTmaWarpSpecializedCooperative;
   using AtomLayoutMNK = Layout<Shape<_2,_1,_1>>;
-  using TiledMma = decltype(cute::make_tiled_mma(cute::GMMA::rs_op_selector<
+  using TiledMma = decltype(cute::make_tiled_mma(GMMA::rs_op_selector<
       ElementA, ElementB, ElementAccumulator, TileShape_MNK, GMMA::Major::K, GMMA::Major::K>(), AtomLayoutMNK{}));
   using GmemTiledCopyA = decltype(cutlass::gemm::collective::detail::sm90_cluster_shape_to_tma_atom(shape<1>(ClusterShape_MNK{})));
   using GmemTiledCopyB = decltype(cutlass::gemm::collective::detail::sm90_cluster_shape_to_tma_atom(shape<0>(ClusterShape_MNK{})));
-  static constexpr cute::GMMA::Major GmmaMajorA = cutlass::gemm::collective::detail::gmma_rs_tag_to_major_A<LayoutA>();
-  static constexpr cute::GMMA::Major GmmaMajorB = cutlass::gemm::collective::detail::gmma_rs_tag_to_major_B<LayoutB>();
+  static constexpr GMMA::Major GmmaMajorA = cutlass::gemm::collective::detail::gmma_rs_tag_to_major_A<LayoutA>();
+  static constexpr GMMA::Major GmmaMajorB = cutlass::gemm::collective::detail::gmma_rs_tag_to_major_B<LayoutB>();
   using SmemLayoutAtomA = decltype(cutlass::gemm::collective::detail::rs_smem_selector<GmmaMajorA, ElementA,
       decltype(cute::get<0>(TileShape_MNK{})), decltype(cute::get<2>(TileShape_MNK{})), false>());
   using SmemLayoutAtomB = decltype(cutlass::gemm::collective::detail::rs_smem_selector<GmmaMajorB, ElementB,
@@ -159,12 +159,12 @@ TEST(SM90_Device_Gemm_e4m3t_e4m3n_f32n_tensor_op_gmma_f32_cooperative, 128x128x1
 
   using KernelScheduleType = cutlass::gemm::KernelTmaWarpSpecializedCooperative;
   using AtomLayoutMNK = Layout<Shape<_2,_1,_1>>;
-  using TiledMma = decltype(cute::make_tiled_mma(cute::GMMA::rs_op_selector<
+  using TiledMma = decltype(cute::make_tiled_mma(GMMA::rs_op_selector<
       ElementA, ElementB, ElementAccumulator, TileShape_MNK, GMMA::Major::K, GMMA::Major::K>(), AtomLayoutMNK{}));
   using GmemTiledCopyA = decltype(cutlass::gemm::collective::detail::sm90_cluster_shape_to_tma_atom(shape<1>(ClusterShape_MNK{})));
   using GmemTiledCopyB = decltype(cutlass::gemm::collective::detail::sm90_cluster_shape_to_tma_atom(shape<0>(ClusterShape_MNK{})));
-  static constexpr cute::GMMA::Major GmmaMajorA = cutlass::gemm::collective::detail::gmma_rs_tag_to_major_A<LayoutA>();
-  static constexpr cute::GMMA::Major GmmaMajorB = cutlass::gemm::collective::detail::gmma_rs_tag_to_major_B<LayoutB>();
+  static constexpr GMMA::Major GmmaMajorA = cutlass::gemm::collective::detail::gmma_rs_tag_to_major_A<LayoutA>();
+  static constexpr GMMA::Major GmmaMajorB = cutlass::gemm::collective::detail::gmma_rs_tag_to_major_B<LayoutB>();
   using SmemLayoutAtomA = decltype(cutlass::gemm::collective::detail::rs_smem_selector<GmmaMajorA, ElementA,
       decltype(cute::get<0>(TileShape_MNK{})), decltype(cute::get<2>(TileShape_MNK{})), false>());
   using SmemLayoutAtomB = decltype(cutlass::gemm::collective::detail::rs_smem_selector<GmmaMajorB, ElementB,
diff --git a/test/unit/gemm/device/sm90_gemm_f8_f8_f8_tensor_op_fp32_evt.cu b/test/unit/gemm/device/sm90_gemm_f8_f8_f8_tensor_op_fp32_evt.cu
index 0e235bc8b5..9873594c1a 100644
--- a/test/unit/gemm/device/sm90_gemm_f8_f8_f8_tensor_op_fp32_evt.cu
+++ b/test/unit/gemm/device/sm90_gemm_f8_f8_f8_tensor_op_fp32_evt.cu
@@ -194,4 +194,62 @@ TEST(SM90_Device_Gemm_f8t_f8n_f8t_tensor_op_gmma_f32_persistent_epilogue, 64x128
   bool passed = test::gemm::device::TestAllEVT<Gemm, HostReference>(true);
   EXPECT_TRUE(passed);
 }
+
+// Z = scale_a * scale_b * alpha * acc + beta * scale_c * C + per-row bias
+// if D is fp8 
+//   D = scale_d * filter_negative_zeros(Z)
+// else
+//   D = filter_negative_zeros(Z)
+TEST(SM90_Device_Gemm_f8t_f8n_f8t_tensor_op_gmma_f32_persistent_epilogue, 64x128x128_1x1x1_ScaledLinCombPerRowBiasEltFilter) {
+  using LayoutA = cutlass::layout::RowMajor;
+  using LayoutB = cutlass::layout::ColumnMajor;
+  using LayoutC = cutlass::layout::RowMajor;
+  using TileShape_MNK = Shape<_64,_128,_128>;
+  using ClusterShape_MNK = Shape<_1,_1,_1>;
+
+  using EpilogueSchedule = cutlass::epilogue::TmaWarpSpecialized;
+  using FusionCallbacks = cutlass::epilogue::fusion::Sm90ScaledLinCombPerRowBiasEltAct<
+    TileShape_MNK,                                   // CtaTileShapeMNK
+    cutlass::epilogue::thread::ElementwiseFilter,    // ActivationFn
+    cutlass::float_e4m3_t,                           // ElementOutput
+    float,                                           // ElementCompute
+    cutlass::float_e4m3_t                            // ElementBias
+  >;
+
+  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+      cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+      TileShape_MNK, ClusterShape_MNK,
+      cutlass::epilogue::collective::EpilogueTileAuto,
+      float, float,
+      cutlass::float_e4m3_t, LayoutC, 16,
+      cutlass::float_e4m3_t, LayoutC, 16,
+      EpilogueSchedule,
+      FusionCallbacks
+    >::CollectiveOp;
+
+  using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+      cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+      cutlass::float_e4m3_t, LayoutA, 16,
+      cutlass::float_e4m3_t, LayoutB, 16,
+      float,
+      TileShape_MNK, ClusterShape_MNK,
+      cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+      cutlass::gemm::KernelTmaWarpSpecialized
+    >::CollectiveOp;
+
+  using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+      Shape<int,int,int,int>,
+      CollectiveMainloop,
+      CollectiveEpilogue
+  >;
+
+  using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+
+  // Host reference
+  using HostReference = test::gemm::device::HostScaledLinCombPerRowBiasEltAct<
+    Gemm, cutlass::epilogue::thread::ElementwiseFilter, cutlass::float_e4m3_t
+  >;
+  bool passed = test::gemm::device::TestAllEVT<Gemm, HostReference>(true);
+  EXPECT_TRUE(passed);
+}
 #endif // defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
diff --git a/test/unit/gemm/device/sm90_gemm_stream_k_scheduler.cu b/test/unit/gemm/device/sm90_gemm_stream_k_scheduler.cu
index e447c7a295..2af2a3b0a5 100644
--- a/test/unit/gemm/device/sm90_gemm_stream_k_scheduler.cu
+++ b/test/unit/gemm/device/sm90_gemm_stream_k_scheduler.cu
@@ -104,7 +104,7 @@ test_scheduler(
   typename Scheduler::Arguments args{};
 
   // Set up the grid for the problem
-  dim3 grid = Scheduler::get_grid_shape(problem_shape_mnkl, tile_shape, cluster_shape, hw_info, args);
+  dim3 grid = Scheduler::get_grid_shape(params, problem_shape_mnkl, tile_shape, cluster_shape, hw_info, args);
 
   auto print_info = [&]() {
     std::cout << "Failed with problem size "
diff --git a/test/unit/gemm/device/sm90_sparse_gemm_f16_f16_f32_tensor_op_f32.cu b/test/unit/gemm/device/sm90_sparse_gemm_f16_f16_f32_tensor_op_f32.cu
new file mode 100644
index 0000000000..43d1839faf
--- /dev/null
+++ b/test/unit/gemm/device/sm90_sparse_gemm_f16_f16_f32_tensor_op_f32.cu
@@ -0,0 +1,255 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+/*! \file
+    \brief Tests for device-wide GEMM interface
+*/
+
+#include <iostream>
+
+#include "cutlass/cutlass.h"
+#include "cute/tensor.hpp"
+#include "cute/atom/mma_atom.hpp"
+
+#include "cutlass/numeric_types.h"
+#include "cutlass/arch/mma_sm90.h"
+
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/dispatch_policy.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/epilogue/thread/linear_combination.h"
+
+#include "../../common/cutlass_unit_test.h"
+
+#include "gemm_testbed_3x.hpp"
+
+using namespace cute;
+
+#if defined(CUTLASS_ARCH_MMA_SPARSE_SM90_SUPPORTED)
+
+TEST(SM90_Device_Sparse_Gemm_f16t_f16n_f32t_tensorop_f32, 128x128x64_1x1x1_warpspecialized) {
+  using LayoutA = cutlass::layout::RowMajor;
+  using LayoutB = cutlass::layout::ColumnMajor;
+  using LayoutC = cutlass::layout::RowMajor;
+  using TileShape = Shape<_128,_128,_64>;
+  using ClusterShape = Shape<_1,_1,_1>;
+
+  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+      cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+      TileShape, ClusterShape,
+      cutlass::epilogue::collective::EpilogueTileAuto,
+      float, float,
+      float, LayoutC, 4,
+      float, LayoutC, 4,
+      cutlass::epilogue::collective::EpilogueScheduleAuto
+    >::CollectiveOp;
+
+  using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+      cutlass::arch::Sm90, cutlass::arch::OpClassSparseTensorOp,
+      cutlass::half_t, LayoutA, 16,
+      cutlass::half_t, LayoutB, 8,
+      float,
+      TileShape, ClusterShape,
+      cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+      cutlass::gemm::KernelTmaWarpSpecialized
+    >::CollectiveOp;
+
+  using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+      Shape<int,int,int,int>,
+      CollectiveMainloop,
+      CollectiveEpilogue
+    >;
+
+  using namespace test::gemm::device;
+  using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+  bool result = TestAll<Gemm>(1.0, 1.0, CheckEquality::EXACT);
+  EXPECT_TRUE(result);
+}
+
+TEST(SM90_Device_Sparse_Gemm_f16t_f16n_f32t_tensorop_f32, 128x128x64_1x2x1_cooperative) {
+  using LayoutA = cutlass::layout::RowMajor;
+  using LayoutB = cutlass::layout::ColumnMajor;
+  using LayoutC = cutlass::layout::RowMajor;
+  using TileShape = Shape<_128,_128,_64>;
+  using ClusterShape = Shape<_1,_2,_1>;
+
+  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+      cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+      TileShape, ClusterShape,
+      cutlass::epilogue::collective::EpilogueTileAuto,
+      float, float,
+      float, LayoutC, 4,
+      float, LayoutC, 4,
+      cutlass::epilogue::TmaWarpSpecializedCooperative
+    >::CollectiveOp;
+
+  using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+      cutlass::arch::Sm90, cutlass::arch::OpClassSparseTensorOp,
+      cutlass::half_t, LayoutA, 16,
+      cutlass::half_t, LayoutB, 8,
+      float,
+      TileShape, ClusterShape,
+      cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+      cutlass::gemm::KernelTmaWarpSpecializedCooperative
+    >::CollectiveOp;
+
+  using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+      Shape<int,int,int,int>,
+      CollectiveMainloop,
+      CollectiveEpilogue
+    >;
+
+  using namespace test::gemm::device;
+  using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+  bool result = TestAll<Gemm>(1.0, 1.0, CheckEquality::EXACT);
+  EXPECT_TRUE(result);
+}
+
+TEST(SM90_Device_Sparse_Gemm_f16t_f16n_f32t_tensorop_f32, 128x128x64_2x1x1_pingpong) {
+  using LayoutA = cutlass::layout::RowMajor;
+  using LayoutB = cutlass::layout::ColumnMajor;
+  using LayoutC = cutlass::layout::RowMajor;
+  using TileShape = Shape<_128,_128,_64>;
+  using ClusterShape = Shape<_2,_1,_1>;
+
+  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+      cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+      TileShape, ClusterShape,
+      cutlass::epilogue::collective::EpilogueTileAuto,
+      float, float,
+      float, LayoutC, 4,
+      float, LayoutC, 4,
+      cutlass::epilogue::TmaWarpSpecialized
+    >::CollectiveOp;
+
+  using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+      cutlass::arch::Sm90, cutlass::arch::OpClassSparseTensorOp,
+      cutlass::half_t, LayoutA, 16,
+      cutlass::half_t, LayoutB, 8,
+      float,
+      TileShape, ClusterShape,
+      cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+      cutlass::gemm::KernelTmaWarpSpecializedPingpong
+    >::CollectiveOp;
+
+  using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+      Shape<int,int,int,int>,
+      CollectiveMainloop,
+      CollectiveEpilogue
+    >;
+
+  using namespace test::gemm::device;
+  using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+  bool result = TestAll<Gemm>(1.0, 1.0, CheckEquality::EXACT);
+  EXPECT_TRUE(result);
+}
+
+TEST(SM90_Device_Sparse_Gemm_bf16t_bf16n_f32t_tensorop_f32, 128x128x128_1x1x1) {
+  using LayoutA = cutlass::layout::RowMajor;
+  using LayoutB = cutlass::layout::ColumnMajor;
+  using LayoutC = cutlass::layout::RowMajor;
+  using TileShape = Shape<_128,_128,_128>;
+  using ClusterShape = Shape<_1,_1,_1>;
+
+  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+      cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+      TileShape, ClusterShape,
+      cutlass::epilogue::collective::EpilogueTileAuto,
+      float, float,
+      float, LayoutC, 4,
+      float, LayoutC, 4,
+      cutlass::epilogue::collective::EpilogueScheduleAuto
+    >::CollectiveOp;
+
+  using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+      cutlass::arch::Sm90, cutlass::arch::OpClassSparseTensorOp,
+      cutlass::bfloat16_t, LayoutA, 16,
+      cutlass::bfloat16_t, LayoutB, 8,
+      float,
+      TileShape, ClusterShape,
+      cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+      cutlass::gemm::collective::KernelScheduleAuto
+    >::CollectiveOp;
+
+  using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+      Shape<int,int,int,int>,
+      CollectiveMainloop,
+      CollectiveEpilogue
+    >;
+
+  using namespace test::gemm::device;
+  using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+  bool result = TestAll<Gemm>(1.0, 1.0, CheckEquality::EXACT);
+  EXPECT_TRUE(result);
+}
+
+TEST(SM90_Device_Sparse_Gemm_f16t_f16n_f16t_tensorop_f16, 128x128x32_1x1x1) {
+  using LayoutA = cutlass::layout::RowMajor;
+  using LayoutB = cutlass::layout::ColumnMajor;
+  using LayoutC = cutlass::layout::RowMajor;
+  using TileShape = Shape<_128,_128,_32>;
+  using ClusterShape = Shape<_1,_1,_1>;
+
+  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+      cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+      TileShape, ClusterShape,
+      cutlass::epilogue::collective::EpilogueTileAuto,
+      half_t, half_t,
+      half_t, LayoutC, 4,
+      half_t, LayoutC, 4,
+      cutlass::epilogue::collective::EpilogueScheduleAuto
+    >::CollectiveOp;
+
+  using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+      cutlass::arch::Sm90, cutlass::arch::OpClassSparseTensorOp,
+      cutlass::half_t, LayoutA, 16,
+      cutlass::half_t, LayoutB, 8,
+      half_t,
+      TileShape, ClusterShape,
+      cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+      cutlass::gemm::collective::KernelScheduleAuto
+    >::CollectiveOp;
+
+  using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+      Shape<int,int,int,int>,
+      CollectiveMainloop,
+      CollectiveEpilogue
+    >;
+
+  using namespace test::gemm::device;
+  using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+  bool result = TestAll<Gemm>(1.0, 1.0, CheckEquality::EXACT);
+  EXPECT_TRUE(result);
+}
+
+#endif // #if defined(CUTLASS_ARCH_MMA_SPARSE_SM90_SUPPORTED)
diff --git a/test/unit/gemm/device/sm90_sparse_gemm_f8_f8_f32_tensor_op_f32.cu b/test/unit/gemm/device/sm90_sparse_gemm_f8_f8_f32_tensor_op_f32.cu
new file mode 100644
index 0000000000..9b15b74bb8
--- /dev/null
+++ b/test/unit/gemm/device/sm90_sparse_gemm_f8_f8_f32_tensor_op_f32.cu
@@ -0,0 +1,216 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+/*! \file
+    \brief Tests for device-wide GEMM interface
+*/
+
+#include <iostream>
+
+#include "cutlass/cutlass.h"
+#include "cute/tensor.hpp"
+#include "cute/atom/mma_atom.hpp"
+
+#include "cutlass/numeric_types.h"
+#include "cutlass/arch/mma_sm90.h"
+
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/dispatch_policy.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/epilogue/thread/linear_combination.h"
+
+#include "../../common/cutlass_unit_test.h"
+
+#include "gemm_testbed_3x.hpp"
+
+using namespace cute;
+
+#if defined(CUTLASS_ARCH_MMA_SPARSE_SM90_SUPPORTED)
+
+TEST(SM90_Device_Sparse_Gemm_e4m3t_e5m2n_f32t_tensorop_f32, 128x128x128_1x1x1) {
+  using LayoutA = cutlass::layout::RowMajor;
+  using LayoutB = cutlass::layout::ColumnMajor;
+  using LayoutC = cutlass::layout::RowMajor;
+  using TileShape = Shape<_128,_128,_128>;
+  using ClusterShape = Shape<_1,_1,_1>;
+
+  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+      cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+      TileShape, ClusterShape,
+      cutlass::epilogue::collective::EpilogueTileAuto,
+      float, float,
+      float, LayoutC, 4,
+      float, LayoutC, 4,
+      cutlass::epilogue::collective::EpilogueScheduleAuto
+    >::CollectiveOp;
+
+  using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+      cutlass::arch::Sm90, cutlass::arch::OpClassSparseTensorOp,
+      cutlass::float_e4m3_t, LayoutA, 32,
+      cutlass::float_e5m2_t, LayoutB, 16,
+      float,
+      TileShape, ClusterShape,
+      cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+      cutlass::gemm::collective::KernelScheduleAuto
+    >::CollectiveOp;
+
+  using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+      Shape<int,int,int,int>,
+      CollectiveMainloop,
+      CollectiveEpilogue
+    >;
+  
+  using namespace test::gemm::device;
+  using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+  bool result = TestAll<Gemm>(1.0, 1.0, CheckEquality::EXACT);
+  EXPECT_TRUE(result);
+}
+
+TEST(SM90_Device_Sparse_Gemm_e4m3t_e5m2n_f32t_tensorop_f32, 128x128x128_1x1x1_warpspecialized) {
+  using LayoutA = cutlass::layout::RowMajor;
+  using LayoutB = cutlass::layout::ColumnMajor;
+  using LayoutC = cutlass::layout::RowMajor;
+  using TileShape = Shape<_128,_128,_128>;
+  using ClusterShape = Shape<_1,_1,_1>;
+
+  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+      cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+      TileShape, ClusterShape,
+      cutlass::epilogue::collective::EpilogueTileAuto,
+      float, float,
+      float, LayoutC, 4,
+      float, LayoutC, 4,
+      cutlass::epilogue::TmaWarpSpecialized
+    >::CollectiveOp;
+
+  using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+      cutlass::arch::Sm90, cutlass::arch::OpClassSparseTensorOp,
+      cutlass::float_e4m3_t, LayoutA, 32,
+      cutlass::float_e5m2_t, LayoutB, 16,
+      float,
+      TileShape, ClusterShape,
+      cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+      cutlass::gemm::KernelTmaWarpSpecializedFP8FastAccum
+    >::CollectiveOp;
+
+  using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+      Shape<int,int,int,int>,
+      CollectiveMainloop,
+      CollectiveEpilogue
+    >;
+
+  using namespace test::gemm::device;
+  using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+  bool result = TestAll<Gemm>(1.0, 1.0, CheckEquality::EXACT);
+  EXPECT_TRUE(result);
+}
+
+TEST(SM90_Device_Sparse_Gemm_e4m3t_e5m2n_f32t_tensorop_f32, 128x128x256_1x2x1_cooperative) {
+  using LayoutA = cutlass::layout::RowMajor;
+  using LayoutB = cutlass::layout::ColumnMajor;
+  using LayoutC = cutlass::layout::RowMajor;
+  using TileShape = Shape<_128,_128,_256>;
+  using ClusterShape = Shape<_1,_2,_1>;
+
+  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+      cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+      TileShape, ClusterShape,
+      cutlass::epilogue::collective::EpilogueTileAuto,
+      float, float,
+      float, LayoutC, 4,
+      float, LayoutC, 4,
+      cutlass::epilogue::TmaWarpSpecializedCooperative
+    >::CollectiveOp;
+
+  using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+      cutlass::arch::Sm90, cutlass::arch::OpClassSparseTensorOp,
+      cutlass::float_e4m3_t, LayoutA, 32,
+      cutlass::float_e5m2_t, LayoutB, 16,
+      float,
+      TileShape, ClusterShape,
+      cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+      cutlass::gemm::KernelTmaWarpSpecializedCooperativeFP8FastAccum
+    >::CollectiveOp;
+
+  using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+      Shape<int,int,int,int>,
+      CollectiveMainloop,
+      CollectiveEpilogue
+    >;
+
+  using namespace test::gemm::device;
+  using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+  bool result = TestAll<Gemm>(1.0, 1.0, CheckEquality::EXACT);
+  EXPECT_TRUE(result);
+}
+
+TEST(SM90_Device_Sparse_Gemm_e4m3t_e5m2n_f32t_tensorop_f32, 128x128x64_2x1x1_pingpong) {
+  using LayoutA = cutlass::layout::RowMajor;
+  using LayoutB = cutlass::layout::ColumnMajor;
+  using LayoutC = cutlass::layout::RowMajor;
+  using TileShape = Shape<_128,_128,_64>;
+  using ClusterShape = Shape<_2,_1,_1>;
+
+  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+      cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+      TileShape, ClusterShape,
+      cutlass::epilogue::collective::EpilogueTileAuto,
+      float, float,
+      float, LayoutC, 4,
+      float, LayoutC, 4,
+      cutlass::epilogue::TmaWarpSpecialized
+    >::CollectiveOp;
+
+  using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+      cutlass::arch::Sm90, cutlass::arch::OpClassSparseTensorOp,
+      cutlass::float_e4m3_t, LayoutA, 32,
+      cutlass::float_e5m2_t, LayoutB, 16,
+      float,
+      TileShape, ClusterShape,
+      cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+      cutlass::gemm::KernelTmaWarpSpecializedPingpongFP8FastAccum
+    >::CollectiveOp;
+
+  using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+      Shape<int,int,int,int>,
+      CollectiveMainloop,
+      CollectiveEpilogue
+    >;
+
+  using namespace test::gemm::device;
+  using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+  bool result = TestAll<Gemm>(1.0, 1.0, CheckEquality::EXACT);
+  EXPECT_TRUE(result);
+}
+
+#endif // #if defined(CUTLASS_ARCH_MMA_SPARSE_SM90_SUPPORTED)
diff --git a/test/unit/gemm/device/sm90_sparse_gemm_s8_s8_s32_tensor_op_s32.cu b/test/unit/gemm/device/sm90_sparse_gemm_s8_s8_s32_tensor_op_s32.cu
new file mode 100644
index 0000000000..09e52f5170
--- /dev/null
+++ b/test/unit/gemm/device/sm90_sparse_gemm_s8_s8_s32_tensor_op_s32.cu
@@ -0,0 +1,216 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+/*! \file
+    \brief Tests for device-wide GEMM interface
+*/
+
+#include <iostream>
+
+#include "cutlass/cutlass.h"
+#include "cute/tensor.hpp"
+#include "cute/atom/mma_atom.hpp"
+
+#include "cutlass/numeric_types.h"
+#include "cutlass/arch/mma_sm90.h"
+
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/dispatch_policy.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/epilogue/thread/linear_combination.h"
+
+#include "../../common/cutlass_unit_test.h"
+
+#include "gemm_testbed_3x.hpp"
+
+using namespace cute;
+
+#if defined(CUTLASS_ARCH_MMA_SPARSE_SM90_SUPPORTED)
+
+TEST(SM90_Device_Sparse_Gemm_s8t_s8n_s32t_tensorop_s32, 128x128x128_1x1x1) {
+  using LayoutA = cutlass::layout::RowMajor;
+  using LayoutB = cutlass::layout::ColumnMajor;
+  using LayoutC = cutlass::layout::RowMajor;
+  using TileShape = Shape<_128,_128,_128>;
+  using ClusterShape = Shape<_1,_1,_1>;
+
+  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+      cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+      TileShape, ClusterShape,
+      cutlass::epilogue::collective::EpilogueTileAuto,
+      int32_t, int32_t,
+      int32_t, LayoutC, 4,
+      int32_t, LayoutC, 4,
+      cutlass::epilogue::collective::EpilogueScheduleAuto
+    >::CollectiveOp;
+
+  using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+      cutlass::arch::Sm90, cutlass::arch::OpClassSparseTensorOp,
+      int8_t, LayoutA, 32,
+      int8_t, LayoutB, 16,
+      int32_t,
+      TileShape, ClusterShape,
+      cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+      cutlass::gemm::collective::KernelScheduleAuto
+    >::CollectiveOp;
+
+  using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+      Shape<int,int,int,int>,
+      CollectiveMainloop,
+      CollectiveEpilogue
+    >;
+
+  using namespace test::gemm::device;
+  using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+  bool result = TestAll<Gemm>(1.0, 1.0, CheckEquality::EXACT);
+  EXPECT_TRUE(result);
+}
+
+TEST(SM90_Device_Sparse_Gemm_s8t_s8n_s32t_tensorop_s32, 128x128x128_1x1x1_warpspecialized) {
+  using LayoutA = cutlass::layout::RowMajor;
+  using LayoutB = cutlass::layout::ColumnMajor;
+  using LayoutC = cutlass::layout::RowMajor;
+  using TileShape = Shape<_128,_128,_128>;
+  using ClusterShape = Shape<_1,_1,_1>;
+
+  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+      cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+      TileShape, ClusterShape,
+      cutlass::epilogue::collective::EpilogueTileAuto,
+      int32_t, int32_t,
+      int32_t, LayoutC, 4,
+      int32_t, LayoutC, 4,
+      cutlass::epilogue::collective::EpilogueScheduleAuto
+    >::CollectiveOp;
+
+  using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+      cutlass::arch::Sm90, cutlass::arch::OpClassSparseTensorOp,
+      int8_t, LayoutA, 32,
+      int8_t, LayoutB, 16,
+      int32_t,
+      TileShape, ClusterShape,
+      cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+      cutlass::gemm::KernelTmaWarpSpecialized
+    >::CollectiveOp;
+
+  using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+      Shape<int,int,int,int>,
+      CollectiveMainloop,
+      CollectiveEpilogue
+    >;
+
+  using namespace test::gemm::device;
+  using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+  bool result = TestAll<Gemm>(1.0, 1.0, CheckEquality::EXACT);
+  EXPECT_TRUE(result);
+}
+
+TEST(SM90_Device_Sparse_Gemm_s8t_s8n_s32t_tensorop_s32, 128x128x256_1x2x1_cooperative) {
+  using LayoutA = cutlass::layout::RowMajor;
+  using LayoutB = cutlass::layout::ColumnMajor;
+  using LayoutC = cutlass::layout::RowMajor;
+  using TileShape = Shape<_128,_128,_256>;
+  using ClusterShape = Shape<_1,_2,_1>;
+
+  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+      cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+      TileShape, ClusterShape,
+      cutlass::epilogue::collective::EpilogueTileAuto,
+      int32_t, int32_t,
+      int32_t, LayoutC, 4,
+      int32_t, LayoutC, 4,
+      cutlass::epilogue::collective::EpilogueScheduleAuto
+    >::CollectiveOp;
+
+  using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+      cutlass::arch::Sm90, cutlass::arch::OpClassSparseTensorOp,
+      int8_t, LayoutA, 32,
+      int8_t, LayoutB, 16,
+      int32_t,
+      TileShape, ClusterShape,
+      cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+      cutlass::gemm::KernelTmaWarpSpecializedCooperative
+    >::CollectiveOp;
+
+  using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+      Shape<int,int,int,int>,
+      CollectiveMainloop,
+      CollectiveEpilogue
+    >;
+
+  using namespace test::gemm::device;
+  using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+  bool result = TestAll<Gemm>(1.0, 1.0, CheckEquality::EXACT);
+  EXPECT_TRUE(result);
+}
+
+TEST(SM90_Device_Sparse_Gemm_s8t_s8n_s32t_tensorop_s32, 128x128x64_2x1x1_pingpong) {
+  using LayoutA = cutlass::layout::RowMajor;
+  using LayoutB = cutlass::layout::ColumnMajor;
+  using LayoutC = cutlass::layout::RowMajor;
+  using TileShape = Shape<_128,_128,_64>;
+  using ClusterShape = Shape<_2,_1,_1>;
+
+  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+      cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+      TileShape, ClusterShape,
+      cutlass::epilogue::collective::EpilogueTileAuto,
+      int32_t, int32_t,
+      int32_t, LayoutC, 4,
+      int32_t, LayoutC, 4,
+      cutlass::epilogue::collective::EpilogueScheduleAuto
+    >::CollectiveOp;
+
+  using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+      cutlass::arch::Sm90, cutlass::arch::OpClassSparseTensorOp,
+      int8_t, LayoutA, 32,
+      int8_t, LayoutB, 16,
+      int32_t,
+      TileShape, ClusterShape,
+      cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+      cutlass::gemm::KernelTmaWarpSpecializedPingpong
+    >::CollectiveOp;
+
+  using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+      Shape<int,int,int,int>,
+      CollectiveMainloop,
+      CollectiveEpilogue
+    >;
+
+  using namespace test::gemm::device;
+  using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+  bool result = TestAll<Gemm>(1.0, 1.0, CheckEquality::EXACT);
+  EXPECT_TRUE(result);
+}
+
+#endif // #if defined(CUTLASS_ARCH_MMA_SPARSE_SM90_SUPPORTED)
diff --git a/test/unit/gemm/device/sm90_sparse_gemm_tf32_tf32_f32_tensor_op_f32.cu b/test/unit/gemm/device/sm90_sparse_gemm_tf32_tf32_f32_tensor_op_f32.cu
new file mode 100644
index 0000000000..cc7b9486c2
--- /dev/null
+++ b/test/unit/gemm/device/sm90_sparse_gemm_tf32_tf32_f32_tensor_op_f32.cu
@@ -0,0 +1,216 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+/*! \file
+    \brief Tests for device-wide GEMM interface
+*/
+
+#include <iostream>
+
+#include "cutlass/cutlass.h"
+#include "cute/tensor.hpp"
+#include "cute/atom/mma_atom.hpp"
+
+#include "cutlass/numeric_types.h"
+#include "cutlass/arch/mma_sm90.h"
+
+#include "cutlass/gemm/device/gemm_universal_adapter.h"
+#include "cutlass/gemm/kernel/gemm_universal.hpp"
+#include "cutlass/gemm/collective/collective_builder.hpp"
+#include "cutlass/epilogue/dispatch_policy.hpp"
+#include "cutlass/epilogue/collective/collective_builder.hpp"
+#include "cutlass/epilogue/thread/linear_combination.h"
+
+#include "../../common/cutlass_unit_test.h"
+
+#include "gemm_testbed_3x.hpp"
+
+using namespace cute;
+
+#if defined(CUTLASS_ARCH_MMA_SPARSE_SM90_SUPPORTED)
+
+TEST(SM90_Device_Sparse_Gemm_tf16t_tf16n_f32t_tensorop_f32, 128x128x64_1x1x1) {
+  using LayoutA = cutlass::layout::RowMajor;
+  using LayoutB = cutlass::layout::ColumnMajor;
+  using LayoutC = cutlass::layout::RowMajor;
+  using TileShape = Shape<_128,_128,_64>;
+  using ClusterShape = Shape<_1,_1,_1>;
+
+  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+      cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+      TileShape, ClusterShape,
+      cutlass::epilogue::collective::EpilogueTileAuto,
+      float, float,
+      float, LayoutC, 4,
+      float, LayoutC, 4,
+      cutlass::epilogue::collective::EpilogueScheduleAuto
+    >::CollectiveOp;
+
+  using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+      cutlass::arch::Sm90, cutlass::arch::OpClassSparseTensorOp,
+      float, LayoutA, 8,
+      float, LayoutB, 4,
+      float,
+      TileShape, ClusterShape,
+      cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+      cutlass::gemm::collective::KernelScheduleAuto
+    >::CollectiveOp;
+
+  using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+      Shape<int,int,int,int>,
+      CollectiveMainloop,
+      CollectiveEpilogue
+    >;
+
+  using namespace test::gemm::device;
+  using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+  bool result = TestAll<Gemm>(1.0, 1.0, CheckEquality::EXACT);
+  EXPECT_TRUE(result);
+}
+
+TEST(SM90_Device_Sparse_Gemm_tf16t_tf16n_f32t_tensorop_f32, 128x128x64_1x1x1_warpspecialized) {
+  using LayoutA = cutlass::layout::RowMajor;
+  using LayoutB = cutlass::layout::ColumnMajor;
+  using LayoutC = cutlass::layout::RowMajor;
+  using TileShape = Shape<_128,_128,_64>;
+  using ClusterShape = Shape<_1,_1,_1>;
+
+  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+      cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+      TileShape, ClusterShape,
+      cutlass::epilogue::collective::EpilogueTileAuto,
+      float, float,
+      float, LayoutC, 4,
+      float, LayoutC, 4,
+      cutlass::epilogue::collective::EpilogueScheduleAuto
+    >::CollectiveOp;
+
+  using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+      cutlass::arch::Sm90, cutlass::arch::OpClassSparseTensorOp,
+      float, LayoutA, 8,
+      float, LayoutB, 4,
+      float,
+      TileShape, ClusterShape,
+      cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+      cutlass::gemm::KernelTmaWarpSpecialized
+    >::CollectiveOp;
+
+  using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+      Shape<int,int,int,int>,
+      CollectiveMainloop,
+      CollectiveEpilogue
+    >;
+
+  using namespace test::gemm::device;
+  using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+  bool result = TestAll<Gemm>(1.0, 1.0, CheckEquality::EXACT);
+  EXPECT_TRUE(result);
+}
+
+TEST(SM90_Device_Sparse_Gemm_tf32t_tf32n_f32t_tensorop_f32, 128x128x32_1x2x1_cooperative) {
+  using LayoutA = cutlass::layout::RowMajor;
+  using LayoutB = cutlass::layout::ColumnMajor;
+  using LayoutC = cutlass::layout::RowMajor;
+  using TileShape = Shape<_128,_128,_32>;
+  using ClusterShape = Shape<_1,_2,_1>;
+
+  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+      cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+      TileShape, ClusterShape,
+      cutlass::epilogue::collective::EpilogueTileAuto,
+      float, float,
+      float, LayoutC, 4,
+      float, LayoutC, 4,
+      cutlass::epilogue::TmaWarpSpecializedCooperative
+    >::CollectiveOp;
+
+  using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+      cutlass::arch::Sm90, cutlass::arch::OpClassSparseTensorOp,
+      float, LayoutA, 8,
+      float, LayoutB, 4,
+      float,
+      TileShape, ClusterShape,
+      cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+      cutlass::gemm::KernelTmaWarpSpecializedCooperative
+    >::CollectiveOp;
+
+  using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+      Shape<int,int,int,int>,
+      CollectiveMainloop,
+      CollectiveEpilogue
+    >;
+
+  using namespace test::gemm::device;
+  using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+  bool result = TestAll<Gemm>(1.0, 1.0, CheckEquality::EXACT);
+  EXPECT_TRUE(result);
+}
+
+TEST(SM90_Device_Sparse_Gemm_tf32t_tf32n_f32t_tensorop_f32, 128x128x16_2x1x1_pingpong) {
+  using LayoutA = cutlass::layout::RowMajor;
+  using LayoutB = cutlass::layout::ColumnMajor;
+  using LayoutC = cutlass::layout::RowMajor;
+  using TileShape = Shape<_128,_128,_16>;
+  using ClusterShape = Shape<_2,_1,_1>;
+
+  using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
+      cutlass::arch::Sm90, cutlass::arch::OpClassTensorOp,
+      TileShape, ClusterShape,
+      cutlass::epilogue::collective::EpilogueTileAuto,
+      float, float,
+      float, LayoutC, 4,
+      float, LayoutC, 4,
+      cutlass::epilogue::TmaWarpSpecialized
+    >::CollectiveOp;
+
+  using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
+      cutlass::arch::Sm90, cutlass::arch::OpClassSparseTensorOp,
+      float, LayoutA, 8,
+      float, LayoutB, 4,
+      float,
+      TileShape, ClusterShape,
+      cutlass::gemm::collective::StageCountAutoCarveout<static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
+      cutlass::gemm::KernelTmaWarpSpecializedPingpong
+    >::CollectiveOp;
+
+  using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
+      Shape<int,int,int,int>,
+      CollectiveMainloop,
+      CollectiveEpilogue
+    >;
+
+  using namespace test::gemm::device;
+  using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
+  bool result = TestAll<Gemm>(1.0, 1.0, CheckEquality::EXACT);
+  EXPECT_TRUE(result);
+}
+
+#endif // #if defined(CUTLASS_ARCH_MMA_SPARSE_SM90_SUPPORTED)
diff --git a/test/unit/gemm/device/syrk_f32n_f32t_tensor_op_fast_f32_sm80.cu b/test/unit/gemm/device/syrk_f32n_f32t_tensor_op_fast_f32_sm80.cu
index b39d9945c8..58486bf096 100644
--- a/test/unit/gemm/device/syrk_f32n_f32t_tensor_op_fast_f32_sm80.cu
+++ b/test/unit/gemm/device/syrk_f32n_f32t_tensor_op_fast_f32_sm80.cu
@@ -78,7 +78,7 @@ TEST(SM80_Device_Syrk_f32n_f32t_l_tensor_op_fast_f32, 128x256x32_64x64x32) {
       ElementAccumulator
     >,
     cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>,
-    4
+    3 
   >;
 
   EXPECT_TRUE(test::gemm::device::TestAllRankKUniversal<RankK>());
diff --git a/test/unit/gemm/device/syrk_f32t_f32t_tensor_op_fast_f32_sm80.cu b/test/unit/gemm/device/syrk_f32t_f32t_tensor_op_fast_f32_sm80.cu
index 57b16cfba0..8b3b22ac67 100644
--- a/test/unit/gemm/device/syrk_f32t_f32t_tensor_op_fast_f32_sm80.cu
+++ b/test/unit/gemm/device/syrk_f32t_f32t_tensor_op_fast_f32_sm80.cu
@@ -78,7 +78,7 @@ TEST(SM80_Device_Syrk_f32t_f32t_l_tensor_op_fast_f32, 128x256x32_64x64x32) {
       ElementAccumulator
     >,
     cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>,
-    4
+    3 
   >;
 
   EXPECT_TRUE(test::gemm::device::TestAllRankKUniversal<RankK>());
diff --git a/test/unit/gemm/device/syrk_tf32n_f32t_tensor_op_f32_sm80.cu b/test/unit/gemm/device/syrk_tf32n_f32t_tensor_op_f32_sm80.cu
index ccd31b8662..6d33a3620b 100644
--- a/test/unit/gemm/device/syrk_tf32n_f32t_tensor_op_f32_sm80.cu
+++ b/test/unit/gemm/device/syrk_tf32n_f32t_tensor_op_f32_sm80.cu
@@ -78,7 +78,7 @@ TEST(SM80_Device_Syrk_tf32n_f32t_l_tensor_op_f32, 128x256x32_64x64x32) {
       ElementAccumulator
     >,
     cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>,
-    4
+    3 
   >;
 
   EXPECT_TRUE(test::gemm::device::TestAllRankKUniversal<RankK>());
diff --git a/test/unit/gemm/device/syrk_tf32t_f32t_tensor_op_f32_sm80.cu b/test/unit/gemm/device/syrk_tf32t_f32t_tensor_op_f32_sm80.cu
index 49463ef433..530c4ba7e1 100644
--- a/test/unit/gemm/device/syrk_tf32t_f32t_tensor_op_f32_sm80.cu
+++ b/test/unit/gemm/device/syrk_tf32t_f32t_tensor_op_f32_sm80.cu
@@ -78,7 +78,7 @@ TEST(SM80_Device_Syrk_tf32t_f32t_l_tensor_op_f32, 128x256x32_64x64x32) {
       ElementAccumulator
     >,
     cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>,
-    4
+    3 
   >;
 
   EXPECT_TRUE(test::gemm::device::TestAllRankKUniversal<RankK>());
diff --git a/test/unit/gemm/device/testbed.h b/test/unit/gemm/device/testbed.h
index c1599561a5..83cf286420 100644
--- a/test/unit/gemm/device/testbed.h
+++ b/test/unit/gemm/device/testbed.h
@@ -208,15 +208,17 @@ struct Testbed {
     EXPECT_GT(cutlass::reference::host::TensorNorm(tensor_B.host_view()), 0);
     EXPECT_GT(cutlass::reference::host::TensorNorm(tensor_C.host_view()), 0);
 
-    if (tensor_D.size() > 1)
-      EXPECT_GT(cutlass::reference::host::TensorNorm(tensor_D.host_view()), 0);
-
-    if (reference_D.size() > 1)
-      EXPECT_GT(cutlass::reference::host::TensorNorm(reference_D.host_view()), 0);
-
+    if (tensor_D.size() > 1) {
+      EXPECT_GT(cutlass::reference::host::TensorNorm(tensor_D.host_view()), 0)
+        << "tensor_D (size " << tensor_D.size() << ") has nonpositive norm";
+    }
+    if (reference_D.size() > 1) {
+      EXPECT_GT(cutlass::reference::host::TensorNorm(reference_D.host_view()), 0)
+        << "reference_D (size " << reference_D.size() << ") has nonpositive norm";
+    }
     bool passed = cutlass::reference::host::TensorEquals(reference_D.host_view(), tensor_D.host_view());
 
-    EXPECT_TRUE(passed);
+    EXPECT_TRUE(passed) << "reference_D does not equal tensor_D";
 
     if (!passed) {
 
@@ -369,9 +371,11 @@ struct Testbed {
 
     cutlass::Status status = gemm_op.initialize(arguments, workspace.get());
 
+    EXPECT_TRUE(status == cutlass::Status::kSuccess)
+      << "gemm_op.initialize returned with error " << to_string(status)
+      << ", indicating that this test is not supported.  Last CUDA error: "
+      << cudaGetErrorString(cudaGetLastError());
     if (status != cutlass::Status::kSuccess) {
-      cudaError_t error = cudaGetLastError();
-      std::cerr << "This test is not supported: " << cudaGetErrorString(error) << "\n";
       return true;
     }
 
@@ -379,19 +383,27 @@ struct Testbed {
     // Run the GEMM
     //
 
-    status = gemm_op();
-
-    EXPECT_TRUE(status == cutlass::Status::kSuccess) << to_string(status);
+    try {
+      status = gemm_op();
+    }
+    catch (std::exception const& e) {
+      EXPECT_TRUE(false) << "gemm_op() threw a std::exception: " << e.what();
+      throw;
+    }
+    catch (...) {
+      EXPECT_TRUE(false) << "gemm_op() threw an exception of unknown type";
+      throw;
+    }
+    EXPECT_TRUE(status == cutlass::Status::kSuccess)
+      << "gemm_op failed with error " << to_string(status);
 
     //
     // Verify
     //
 
     bool passed = this->verify(problem_size, alpha, beta);
-
-    if (!passed) {
-      std::cout << "Error with split_k_slices = " << split_k_slices << ", alpha: " << alpha << std::endl;
-    }
+    EXPECT_TRUE(passed) << "Error: split_k_slices = " << split_k_slices
+      << ", alpha: " << alpha;
 
     return passed;
   }
@@ -470,12 +482,26 @@ bool TestAllGemmBasic(
             for (auto beta : problem_beta) {
 
               cutlass::gemm::GemmCoord problem_size(m, n, k);
-              passed = testbed.run(
-                problem_size, 
-                split_k,
-                cutlass::from_real<ElementCompute>(alpha), 
-                cutlass::from_real<ElementCompute>(beta)
-              );
+              try {
+                passed = testbed.run(
+                  problem_size, 
+                  split_k,
+                  cutlass::from_real<ElementCompute>(alpha), 
+                  cutlass::from_real<ElementCompute>(beta)
+                );
+              }
+              catch (std::exception const& e) {
+                EXPECT_TRUE(false) << "TestAllGemmBasic: testbed.run threw an "
+                  "exception {alpha: " << alpha << ", beta: " << beta << ", m: "
+                  << m << ", n: " << n << ", k: " << k << "}: " << e.what();
+                throw;
+              }
+              catch (...) {
+                EXPECT_TRUE(false) << "TestAllGemmBasic: testbed.run threw an "
+                  "exception {alpha: " << alpha << ", beta: " << beta << ", m: "
+                  << m << ", n: " << n << ", k: " << k << "}: (unknown)";
+                throw;
+              }
 
               if (!passed) {
                 return false;
@@ -570,12 +596,26 @@ bool TestGemmPerf(int iterations = 1) {
               cutlass::gemm::GemmCoord problem_size(m, n, k);
 
               for (int i = 0; i < iterations; i++){
-                passed = testbed.run(
-                  problem_size, 
-                  split_k,
-                  cutlass::from_real<ElementCompute>(alpha), 
-                  cutlass::from_real<ElementCompute>(beta)
-                );
+                try {
+                  passed = testbed.run(
+                    problem_size, 
+                    split_k,
+                    cutlass::from_real<ElementCompute>(alpha), 
+                    cutlass::from_real<ElementCompute>(beta)
+                  );
+                }
+                catch (std::exception const& e) {
+                  EXPECT_TRUE(false) << "TestGemmPerf: testbed.run threw an "
+                    "exception {alpha: " << alpha << ", beta: " << beta << ", m: "
+                    << m << ", n: " << n << ", k: " << k << "}: " << e.what();
+                  throw;
+                }
+                catch (...) {
+                  EXPECT_TRUE(false) << "TestGemmPerf: testbed.run threw an "
+                    "exception {alpha: " << alpha << ", beta: " << beta << ", m: "
+                    << m << ", n: " << n << ", k: " << k << "}: (unknown)";
+                  throw;
+                }
               }
 
               if (!passed) {
diff --git a/test/unit/gemm/device/testbed_gemm_with_broadcast.h b/test/unit/gemm/device/testbed_gemm_with_broadcast.h
index eeb1c2d61a..491a8b332b 100644
--- a/test/unit/gemm/device/testbed_gemm_with_broadcast.h
+++ b/test/unit/gemm/device/testbed_gemm_with_broadcast.h
@@ -161,7 +161,7 @@ struct TestbedGemmWithBroadcast {
       int bits_output = cutlass::sizeof_bits<typename Gemm::ElementC>::value;
 
       if (bits_input == 1) {
-        scope_max = 2;
+        scope_max = 1;
         scope_min = 0;
       } else if (bits_input <= 8) {
         scope_max = 2;
diff --git a/test/unit/gemm/device/testbed_gemm_with_reduction.h b/test/unit/gemm/device/testbed_gemm_with_reduction.h
index e97e0d937b..12c8d408fc 100644
--- a/test/unit/gemm/device/testbed_gemm_with_reduction.h
+++ b/test/unit/gemm/device/testbed_gemm_with_reduction.h
@@ -143,7 +143,7 @@ struct TestbedGemmWithReduction {
       int bits_output = cutlass::sizeof_bits<typename Gemm::ElementC>::value;
 
       if (bits_input == 1) {
-        scope_max = 2;
+        scope_max = 1;
         scope_min = 0;
       } else if (bits_input <= 8) {
         scope_max = 2;
diff --git a/test/unit/gemm/device/testbed_universal.h b/test/unit/gemm/device/testbed_universal.h
index 8dc92db0e5..f7dd861b5b 100644
--- a/test/unit/gemm/device/testbed_universal.h
+++ b/test/unit/gemm/device/testbed_universal.h
@@ -112,8 +112,13 @@ struct TestbedUniversal {
         scope_max = is_unsigned_int ? 2 : 1;
         scope_min = is_unsigned_int ? 0 : -1;
       } else if (bits_output == 16) {
-        scope_max = is_unsigned_int ? 10 : 5;
-        scope_min = is_unsigned_int ? 0 : -5;
+        constexpr auto u8_bf16 =
+          (cutlass::platform::is_same<ElementA, uint8_t>::value &&
+           cutlass::platform::is_same<ElementB, cutlass::bfloat16_t>::value) ||
+          (cutlass::platform::is_same<ElementA, cutlass::bfloat16_t>::value &&
+           cutlass::platform::is_same<ElementB, uint8_t>::value);
+        scope_max = is_unsigned_int ? 10 : (u8_bf16 ? 3 : 5);
+        scope_min = is_unsigned_int ? 0 : (u8_bf16 ? -3 : -5);
       } else {
         scope_max = 8;
         scope_min = -8;
@@ -196,6 +201,7 @@ struct TestbedUniversal {
     if (!passed) {
 
       /*
+
       std::stringstream fname;
 
       fname << "error_Gemm_device_"
diff --git a/test/unit/gemm/threadblock/mma_multistage.cu b/test/unit/gemm/threadblock/mma_multistage.cu
index a802035b02..13df9dc13f 100644
--- a/test/unit/gemm/threadblock/mma_multistage.cu
+++ b/test/unit/gemm/threadblock/mma_multistage.cu
@@ -3006,7 +3006,6 @@ TEST(SM80_gemm_threadblock_crosswise,
                                             problem_size.k(), alpha, beta)
       .run(grid, block);
 }
-
 ////////////////////////////////////////////////////////////////////////////////
 
 TEST(SM80_gemm_threadblock_crosswise,
@@ -3041,7 +3040,6 @@ TEST(SM80_gemm_threadblock_crosswise,
                                             problem_size.k(), alpha, beta)
       .run(grid, block);
 }
-
 ////////////////////////////////////////////////////////////////////////////////
 
 TEST(SM80_gemm_threadblock_crosswise,
@@ -3076,7 +3074,6 @@ TEST(SM80_gemm_threadblock_crosswise,
                                             problem_size.k(), alpha, beta)
       .run(grid, block);
 }
-
 ////////////////////////////////////////////////////////////////////////////////
 
 TEST(SM80_gemm_threadblock_crosswise,
@@ -3111,7 +3108,6 @@ TEST(SM80_gemm_threadblock_crosswise,
                                             problem_size.k(), alpha, beta)
       .run(grid, block);
 }
-
 ////////////////////////////////////////////////////////////////////////////////
 
 TEST(SM80_gemm_threadblock_crosswise,
@@ -3146,7 +3142,6 @@ TEST(SM80_gemm_threadblock_crosswise,
                                             problem_size.k(), alpha, beta)
       .run(grid, block);
 }
-
 ////////////////////////////////////////////////////////////////////////////////
 
 TEST(SM80_gemm_threadblock_crosswise,
@@ -3181,7 +3176,6 @@ TEST(SM80_gemm_threadblock_crosswise,
                                             problem_size.k(), alpha, beta)
       .run(grid, block);
 }
-
 ////////////////////////////////////////////////////////////////////////////////
 
 TEST(SM80_gemm_threadblock_crosswise,
@@ -3216,7 +3210,6 @@ TEST(SM80_gemm_threadblock_crosswise,
                                             problem_size.k(), alpha, beta)
       .run(grid, block);
 }
-
 ////////////////////////////////////////////////////////////////////////////////
 
 TEST(SM80_gemm_threadblock_crosswise,
@@ -3251,7 +3244,6 @@ TEST(SM80_gemm_threadblock_crosswise,
                                             problem_size.k(), alpha, beta)
       .run(grid, block);
 }
-
 ////////////////////////////////////////////////////////////////////////////////
 
 TEST(SM80_gemm_threadblock_crosswise,
@@ -3286,7 +3278,6 @@ TEST(SM80_gemm_threadblock_crosswise,
                                             problem_size.k(), alpha, beta)
       .run(grid, block);
 }
-
 ////////////////////////////////////////////////////////////////////////////////
 
 TEST(SM80_gemm_threadblock_crosswise,
@@ -3321,7 +3312,6 @@ TEST(SM80_gemm_threadblock_crosswise,
                                             problem_size.k(), alpha, beta)
       .run(grid, block);
 }
-
 ////////////////////////////////////////////////////////////////////////////////
 
 TEST(SM80_gemm_threadblock_crosswise,
@@ -3356,7 +3346,6 @@ TEST(SM80_gemm_threadblock_crosswise,
                                             problem_size.k(), alpha, beta)
       .run(grid, block);
 }
-
 ////////////////////////////////////////////////////////////////////////////////
 
 TEST(SM80_gemm_threadblock_crosswise,
@@ -3391,7 +3380,6 @@ TEST(SM80_gemm_threadblock_crosswise,
                                             problem_size.k(), alpha, beta)
       .run(grid, block);
 }
-
 ////////////////////////////////////////////////////////////////////////////////
 
 TEST(SM80_gemm_threadblock_crosswise,
@@ -3426,7 +3414,6 @@ TEST(SM80_gemm_threadblock_crosswise,
                                             problem_size.k(), alpha, beta)
       .run(grid, block);
 }
-
 ////////////////////////////////////////////////////////////////////////////////
 
 TEST(SM80_gemm_threadblock_crosswise,
@@ -3461,7 +3448,6 @@ TEST(SM80_gemm_threadblock_crosswise,
                                             problem_size.k(), alpha, beta)
       .run(grid, block);
 }
-
 ////////////////////////////////////////////////////////////////////////////////
 TEST(SM80_gemm_threadblock_congruous,
      tensor_op_64x64x16_32x64x16_8x8x4_3stage) {
diff --git a/test/unit/gemm/warp/gemm_mixed_input_sm80.cu b/test/unit/gemm/warp/gemm_mixed_input_sm80.cu
index eb7d8023d0..db5b178f38 100644
--- a/test/unit/gemm/warp/gemm_mixed_input_sm80.cu
+++ b/test/unit/gemm/warp/gemm_mixed_input_sm80.cu
@@ -324,4 +324,52 @@ TEST(SM80_warp_gemm_mixed_input_tensor_op_crosswise_i8_bf16, 64x64x64_64x64x64_1
       .run();
 }
 
+////////////////////////////////////////////////////////////////////////////////
+/// S32 <= I4 * I8 + S32 (Upcast on Operand A)
+////////////////////////////////////////////////////////////////////////////////
+
+TEST(SM80_warp_gemm_mixed_input_tensor_op_crosswise_i4_i8, 64x64x64_64x64x64_16x8x16) {
+  using Shape = cutlass::gemm::GemmShape<64, 64, 64>;
+  using InstructionShape = cutlass::gemm::GemmShape<16, 8, 32>;
+  using ElementA = cutlass::int4b_t;
+  using ElementB = int8_t;
+  using ElementC = int32_t;
+  using LayoutA = cutlass::layout::RowMajorTensorOpMultiplicandCrosswise<
+      cutlass::sizeof_bits<ElementA>::value, 64>;
+  using LayoutB = cutlass::layout::ColumnMajorTensorOpMultiplicandCrosswise<
+      cutlass::sizeof_bits<ElementB>::value, 64>;
+
+  using MmaTensorOp = typename cutlass::gemm::warp::DefaultMmaTensorOp<
+      Shape, InstructionShape, ElementA, LayoutA, ElementB, LayoutB, ElementC,
+      cutlass::layout::RowMajor, cutlass::arch::OpMultiplyAddMixedInputUpcast>::Type;
+
+  test::gemm::warp::TransformTestbed<MmaTensorOp,
+                            cutlass::gemm::GemmShape<64, 64, 64> >()
+      .run();
+}
+
+////////////////////////////////////////////////////////////////////////////////
+/// S32 <= I8 * I4 + S32 (Upcast on Operand B)
+////////////////////////////////////////////////////////////////////////////////
+
+TEST(SM80_warp_gemm_mixed_input_tensor_op_crosswise_i8_i4, 64x64x64_64x64x64_16x8x32) {
+  using Shape = cutlass::gemm::GemmShape<64, 64, 64>;
+  using InstructionShape = cutlass::gemm::GemmShape<16, 8, 32>;
+  using ElementA = int8_t;
+  using ElementB = cutlass::int4b_t;
+  using ElementC = int32_t;
+  using LayoutA = cutlass::layout::RowMajorTensorOpMultiplicandCrosswise<
+      cutlass::sizeof_bits<ElementA>::value, 64>;
+  using LayoutB = cutlass::layout::ColumnMajorTensorOpMultiplicandCrosswise<
+      cutlass::sizeof_bits<ElementB>::value, 64>;
+
+  using MmaTensorOp = typename cutlass::gemm::warp::DefaultMmaTensorOp<
+      Shape, InstructionShape, ElementA, LayoutA, ElementB, LayoutB, ElementC,
+      cutlass::layout::RowMajor, cutlass::arch::OpMultiplyAddMixedInputUpcast>::Type;
+
+  test::gemm::warp::TransformTestbed<MmaTensorOp,
+                            cutlass::gemm::GemmShape<64, 64, 64> >()
+      .run();
+}
+
 #endif // if defined(CUTLASS_ARCH_MMA_SM80_SUPPORTED)
diff --git a/test/unit/gemm/warp/gemm_sm80.cu b/test/unit/gemm/warp/gemm_sm80.cu
index f305e8ff78..c9732f216b 100644
--- a/test/unit/gemm/warp/gemm_sm80.cu
+++ b/test/unit/gemm/warp/gemm_sm80.cu
@@ -1357,215 +1357,6 @@ TEST(SM80_warp_gemm_tensor_op_crosswise_i4, 128x128x256_16x16x256_16x8x64) {
       .run();
 }
 
-////////////////////////////////////////////////////////////////////////////////
-TEST(SM80_warp_gemm_tensor_op_crosswise_b1, 128x128x512_64x64x512_16x8x256) {
-  using Shape = cutlass::gemm::GemmShape<64, 64, 512>;
-  using InstructionShape = cutlass::gemm::GemmShape<16, 8, 256>;
-  using Element = cutlass::uint1b_t;
-  using ElementC = int;
-  using LayoutA = cutlass::layout::RowMajorTensorOpMultiplicandCrosswise<
-      cutlass::sizeof_bits<Element>::value, 512>;
-  using LayoutB = cutlass::layout::ColumnMajorTensorOpMultiplicandCrosswise<
-      cutlass::sizeof_bits<Element>::value, 512>;
-
-  using MmaTensorOp = typename cutlass::gemm::warp::DefaultMmaTensorOp<
-      Shape, InstructionShape, Element, LayoutA, Element, LayoutB, ElementC,
-      cutlass::layout::RowMajor, cutlass::arch::OpMultiplyAdd>::Type;
-
-  test::gemm::warp::Testbed<MmaTensorOp,
-                            cutlass::gemm::GemmShape<128, 128, 512> >()
-      .run();
-}
-
-////////////////////////////////////////////////////////////////////////////////
-
-TEST(SM80_warp_gemm_tensor_op_crosswise_b1, 128x128x512_64x32x512_16x8x256) {
-  using Shape = cutlass::gemm::GemmShape<64, 32, 512>;
-  using InstructionShape = cutlass::gemm::GemmShape<16, 8, 256>;
-  using Element = cutlass::uint1b_t;
-  using ElementC = int;
-  using LayoutA = cutlass::layout::RowMajorTensorOpMultiplicandCrosswise<
-      cutlass::sizeof_bits<Element>::value, 512>;
-  using LayoutB = cutlass::layout::ColumnMajorTensorOpMultiplicandCrosswise<
-      cutlass::sizeof_bits<Element>::value, 512>;
-
-  using MmaTensorOp = typename cutlass::gemm::warp::DefaultMmaTensorOp<
-      Shape, InstructionShape, Element, LayoutA, Element, LayoutB, ElementC,
-      cutlass::layout::RowMajor, cutlass::arch::OpMultiplyAdd>::Type;
-
-  test::gemm::warp::Testbed<MmaTensorOp,
-                            cutlass::gemm::GemmShape<128, 128, 512> >()
-      .run();
-}
-
-////////////////////////////////////////////////////////////////////////////////
-
-TEST(SM80_warp_gemm_tensor_op_crosswise_b1, 128x128x512_32x32x512_16x8x256) {
-  using Shape = cutlass::gemm::GemmShape<32, 32, 512>;
-  using InstructionShape = cutlass::gemm::GemmShape<16, 8, 256>;
-  using Element = cutlass::uint1b_t;
-  using ElementC = int;
-  using LayoutA = cutlass::layout::RowMajorTensorOpMultiplicandCrosswise<
-      cutlass::sizeof_bits<Element>::value, 512>;
-  using LayoutB = cutlass::layout::ColumnMajorTensorOpMultiplicandCrosswise<
-      cutlass::sizeof_bits<Element>::value, 512>;
-
-  using MmaTensorOp = typename cutlass::gemm::warp::DefaultMmaTensorOp<
-      Shape, InstructionShape, Element, LayoutA, Element, LayoutB, ElementC,
-      cutlass::layout::RowMajor, cutlass::arch::OpMultiplyAdd>::Type;
-
-  test::gemm::warp::Testbed<MmaTensorOp,
-                            cutlass::gemm::GemmShape<128, 128, 512> >()
-      .run();
-}
-
-////////////////////////////////////////////////////////////////////////////////
-
-TEST(SM80_warp_gemm_tensor_op_crosswise_b1, 128x128x512_32x16x512_16x8x256) {
-  using Shape = cutlass::gemm::GemmShape<32, 16, 512>;
-  using InstructionShape = cutlass::gemm::GemmShape<16, 8, 256>;
-  using Element = cutlass::uint1b_t;
-  using ElementC = int;
-  using LayoutA = cutlass::layout::RowMajorTensorOpMultiplicandCrosswise<
-      cutlass::sizeof_bits<Element>::value, 512>;
-  using LayoutB = cutlass::layout::ColumnMajorTensorOpMultiplicandCrosswise<
-      cutlass::sizeof_bits<Element>::value, 512>;
-
-  using MmaTensorOp = typename cutlass::gemm::warp::DefaultMmaTensorOp<
-      Shape, InstructionShape, Element, LayoutA, Element, LayoutB, ElementC,
-      cutlass::layout::RowMajor, cutlass::arch::OpMultiplyAdd>::Type;
-
-  test::gemm::warp::Testbed<MmaTensorOp,
-                            cutlass::gemm::GemmShape<128, 128, 512> >()
-      .run();
-}
-
-////////////////////////////////////////////////////////////////////////////////
-
-TEST(SM80_warp_gemm_tensor_op_crosswise_b1, 128x128x512_16x16x512_16x8x256) {
-  using Shape = cutlass::gemm::GemmShape<16, 16, 512>;
-  using InstructionShape = cutlass::gemm::GemmShape<16, 8, 256>;
-  using Element = cutlass::uint1b_t;
-  using ElementC = int;
-  using LayoutA = cutlass::layout::RowMajorTensorOpMultiplicandCrosswise<
-      cutlass::sizeof_bits<Element>::value, 512>;
-  using LayoutB = cutlass::layout::ColumnMajorTensorOpMultiplicandCrosswise<
-      cutlass::sizeof_bits<Element>::value, 512>;
-
-  using MmaTensorOp = typename cutlass::gemm::warp::DefaultMmaTensorOp<
-      Shape, InstructionShape, Element, LayoutA, Element, LayoutB, ElementC,
-      cutlass::layout::RowMajor, cutlass::arch::OpMultiplyAdd>::Type;
-
-  test::gemm::warp::Testbed<MmaTensorOp,
-                            cutlass::gemm::GemmShape<128, 128, 512> >()
-      .run();
-}
-
-////////////////////////////////////////////////////////////////////////////////
-
-TEST(SM80_warp_gemm_tensor_op_crosswise_b1, 128x128x1024_64x64x1024_16x8x256) {
-  using Shape = cutlass::gemm::GemmShape<64, 64, 1024>;
-  using InstructionShape = cutlass::gemm::GemmShape<16, 8, 256>;
-  using Element = cutlass::uint1b_t;
-  using ElementC = int;
-  using LayoutA = cutlass::layout::RowMajorTensorOpMultiplicandCrosswise<
-      cutlass::sizeof_bits<Element>::value, 1024>;
-  using LayoutB = cutlass::layout::ColumnMajorTensorOpMultiplicandCrosswise<
-      cutlass::sizeof_bits<Element>::value, 1024>;
-
-  using MmaTensorOp = typename cutlass::gemm::warp::DefaultMmaTensorOp<
-      Shape, InstructionShape, Element, LayoutA, Element, LayoutB, ElementC,
-      cutlass::layout::RowMajor, cutlass::arch::OpMultiplyAdd>::Type;
-
-  test::gemm::warp::Testbed<MmaTensorOp,
-                            cutlass::gemm::GemmShape<128, 128, 1024> >()
-      .run();
-}
-
-////////////////////////////////////////////////////////////////////////////////
-
-TEST(SM80_warp_gemm_tensor_op_crosswise_b1, 128x128x1024_64x32x1024_16x8x256) {
-  using Shape = cutlass::gemm::GemmShape<64, 32, 1024>;
-  using InstructionShape = cutlass::gemm::GemmShape<16, 8, 256>;
-  using Element = cutlass::uint1b_t;
-  using ElementC = int;
-  using LayoutA = cutlass::layout::RowMajorTensorOpMultiplicandCrosswise<
-      cutlass::sizeof_bits<Element>::value, 1024>;
-  using LayoutB = cutlass::layout::ColumnMajorTensorOpMultiplicandCrosswise<
-      cutlass::sizeof_bits<Element>::value, 1024>;
-
-  using MmaTensorOp = typename cutlass::gemm::warp::DefaultMmaTensorOp<
-      Shape, InstructionShape, Element, LayoutA, Element, LayoutB, ElementC,
-      cutlass::layout::RowMajor, cutlass::arch::OpMultiplyAdd>::Type;
-
-  test::gemm::warp::Testbed<MmaTensorOp,
-                            cutlass::gemm::GemmShape<128, 128, 1024> >()
-      .run();
-}
-
-////////////////////////////////////////////////////////////////////////////////
-
-TEST(SM80_warp_gemm_tensor_op_crosswise_b1, 128x128x1024_32x32x1024_16x8x256) {
-  using Shape = cutlass::gemm::GemmShape<32, 32, 1024>;
-  using InstructionShape = cutlass::gemm::GemmShape<16, 8, 256>;
-  using Element = cutlass::uint1b_t;
-  using ElementC = int;
-  using LayoutA = cutlass::layout::RowMajorTensorOpMultiplicandCrosswise<
-      cutlass::sizeof_bits<Element>::value, 1024>;
-  using LayoutB = cutlass::layout::ColumnMajorTensorOpMultiplicandCrosswise<
-      cutlass::sizeof_bits<Element>::value, 1024>;
-
-  using MmaTensorOp = typename cutlass::gemm::warp::DefaultMmaTensorOp<
-      Shape, InstructionShape, Element, LayoutA, Element, LayoutB, ElementC,
-      cutlass::layout::RowMajor, cutlass::arch::OpMultiplyAdd>::Type;
-
-  test::gemm::warp::Testbed<MmaTensorOp,
-                            cutlass::gemm::GemmShape<128, 128, 1024> >()
-      .run();
-}
-
-////////////////////////////////////////////////////////////////////////////////
-
-TEST(SM80_warp_gemm_tensor_op_crosswise_b1, 128x128x1024_32x16x1024_16x8x256) {
-  using Shape = cutlass::gemm::GemmShape<32, 16, 1024>;
-  using InstructionShape = cutlass::gemm::GemmShape<16, 8, 256>;
-  using Element = cutlass::uint1b_t;
-  using ElementC = int;
-  using LayoutA = cutlass::layout::RowMajorTensorOpMultiplicandCrosswise<
-      cutlass::sizeof_bits<Element>::value, 1024>;
-  using LayoutB = cutlass::layout::ColumnMajorTensorOpMultiplicandCrosswise<
-      cutlass::sizeof_bits<Element>::value, 1024>;
-
-  using MmaTensorOp = typename cutlass::gemm::warp::DefaultMmaTensorOp<
-      Shape, InstructionShape, Element, LayoutA, Element, LayoutB, ElementC,
-      cutlass::layout::RowMajor, cutlass::arch::OpMultiplyAdd>::Type;
-
-  test::gemm::warp::Testbed<MmaTensorOp,
-                            cutlass::gemm::GemmShape<128, 128, 1024> >()
-      .run();
-}
-
-////////////////////////////////////////////////////////////////////////////////
-
-TEST(SM80_warp_gemm_tensor_op_crosswise_b1, 128x128x1024_16x16x1024_16x8x256) {
-  using Shape = cutlass::gemm::GemmShape<16, 16, 1024>;
-  using InstructionShape = cutlass::gemm::GemmShape<16, 8, 256>;
-  using Element = cutlass::uint1b_t;
-  using ElementC = int;
-  using LayoutA = cutlass::layout::RowMajorTensorOpMultiplicandCrosswise<
-      cutlass::sizeof_bits<Element>::value, 1024>;
-  using LayoutB = cutlass::layout::ColumnMajorTensorOpMultiplicandCrosswise<
-      cutlass::sizeof_bits<Element>::value, 1024>;
-
-  using MmaTensorOp = typename cutlass::gemm::warp::DefaultMmaTensorOp<
-      Shape, InstructionShape, Element, LayoutA, Element, LayoutB, ElementC,
-      cutlass::layout::RowMajor, cutlass::arch::OpMultiplyAdd>::Type;
-
-  test::gemm::warp::Testbed<MmaTensorOp,
-                            cutlass::gemm::GemmShape<128, 128, 1024> >()
-      .run();
-}
-
 ////////////////////////////////////////////////////////////////////////////////
 TEST(SM80_warp_gemm_tensor_op_congruous_f64, 16x16x4_16x16x4_8x8x4) {
   using Shape = cutlass::gemm::GemmShape<16, 16, 4>;
diff --git a/test/unit/transform/device/CMakeLists.txt b/test/unit/transform/device/CMakeLists.txt
new file mode 100644
index 0000000000..74ad63f299
--- /dev/null
+++ b/test/unit/transform/device/CMakeLists.txt
@@ -0,0 +1,58 @@
+# Copyright (c) 2024 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: BSD-3-Clause
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are met:
+#
+# 1. Redistributions of source code must retain the above copyright notice, this
+# list of conditions and the following disclaimer.
+#
+# 2. Redistributions in binary form must reproduce the above copyright notice,
+# this list of conditions and the following disclaimer in the documentation
+# and/or other materials provided with the distribution.
+#
+# 3. Neither the name of the copyright holder nor the names of its
+# contributors may be used to endorse or promote products derived from
+# this software without specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+#
+# Compress Kernel
+#
+
+add_custom_target(
+    cutlass_test_unit_sm90_structured_sparse_gemm_compressor
+    DEPENDS
+    cutlass_test_unit_sm90_structured_sparse_gemm_compressor_f32
+    cutlass_test_unit_sm90_structured_sparse_gemm_compressor_f16
+    cutlass_test_unit_sm90_structured_sparse_gemm_compressor_f8
+)
+
+cutlass_test_unit_add_executable(
+    cutlass_test_unit_sm90_structured_sparse_gemm_compressor_f32
+
+    sm90_sparse_gemm_compressor_f32.cu
+)
+
+cutlass_test_unit_add_executable(
+    cutlass_test_unit_sm90_structured_sparse_gemm_compressor_f16
+
+    sm90_sparse_gemm_compressor_f16.cu
+)
+
+cutlass_test_unit_add_executable(
+    cutlass_test_unit_sm90_structured_sparse_gemm_compressor_f8
+
+    sm90_sparse_gemm_compressor_f8.cu
+)
+
diff --git a/test/unit/transform/device/sm90_sparse_gemm_compressor_f16.cu b/test/unit/transform/device/sm90_sparse_gemm_compressor_f16.cu
new file mode 100644
index 0000000000..2f42d6a10c
--- /dev/null
+++ b/test/unit/transform/device/sm90_sparse_gemm_compressor_f16.cu
@@ -0,0 +1,95 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+#include "cute/atom/mma_traits_sm90_gmma.hpp"                       // cute::GMMA::Major
+#include "cutlass/arch/config.h"                                    // CUTLASS_ARCH_MMA_SM90_SUPPORTED
+#include "cutlass/transform/kernel/sparse_gemm_compressor.hpp"      // StructuredSparseCompressor
+#include "cutlass/transform/device/transform_universal_adapter.hpp" // TransformUniversalAdapter
+#include "cutlass/gemm/collective/builders/sm90_common.inl"         // gmma_ss_tag_to_major_A
+#include "cutlass/gemm/collective/builders/sm90_sparse_config.inl"  // Sm90GemmSparseConfig
+#include "testbed_sparse_gemm_compressor.hpp"                       // TestbedSparseGemmCompressor
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+// * Test Plan
+// ElementA : fp16
+// LayoutA : row / col
+// Gemm : 1x 2x 3x multiplier of alignment requirement. corner case that smaller than alignment requirement
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+#if defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+
+TEST(SM90_Structured_Sparse_Gemm_Compressor_Device, f16_t)
+{
+  // Test Settings
+  using ElementA = cutlass::half_t;
+  using LayoutATag = cutlass::layout::RowMajor;
+
+  // Deduct From Test Setting
+  static constexpr cute::GMMA::Major GmmaMajorA = cutlass::gemm::collective::detail::gmma_rs_tag_to_major_A<LayoutATag>();
+  using ElementAMma = cute::sparse_elem<2, ElementA>;
+  using ElementEMma = cute::sparse_elem<8, uint8_t>;
+
+  using SparseConfig = cutlass::Sm90GemmSparseConfig<ElementAMma, GmmaMajorA, ElementEMma, cute::Int<32>>;
+
+  using CompressorKernel = cutlass::transform::kernel::
+      StructuredSparseCompressor<cute::Shape<int, int, int, int>, ElementA, LayoutATag, SparseConfig, cutlass::arch::Sm90>;
+
+  using Compressor = cutlass::transform::device::TransformUniversalAdapter<CompressorKernel>;
+
+  // Test Bed
+  test::transform::device::TestbedSparseGemmCompressor<Compressor> testbed;
+  EXPECT_TRUE(testbed.run_auto());
+}
+
+TEST(SM90_Structured_Sparse_Gemm_Compressor_Device, f16_n)
+{
+  // Test Settings
+  using ElementA = cutlass::bfloat16_t;
+  using LayoutATag = cutlass::layout::ColumnMajor;
+
+  // Deduct From Test Setting
+  static constexpr cute::GMMA::Major GmmaMajorA = cutlass::gemm::collective::detail::gmma_rs_tag_to_major_A<LayoutATag>();
+  using ElementAMma = cute::sparse_elem<2, ElementA>;
+  using ElementEMma = cute::sparse_elem<8, uint8_t>;
+
+  using SparseConfig = cutlass::Sm90GemmSparseConfig<ElementAMma, GmmaMajorA, ElementEMma, cute::Int<64>>;
+
+  using CompressorKernel = cutlass::transform::kernel::
+      StructuredSparseCompressor<cute::Shape<int, int, int, int>, ElementA, LayoutATag, SparseConfig, cutlass::arch::Sm90>;
+
+  using Compressor = cutlass::transform::device::TransformUniversalAdapter<CompressorKernel>;
+
+  // Test Bed
+  test::transform::device::TestbedSparseGemmCompressor<Compressor> testbed;
+  EXPECT_TRUE(testbed.run_auto());
+}
+
+#endif // #if defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
diff --git a/test/unit/transform/device/sm90_sparse_gemm_compressor_f32.cu b/test/unit/transform/device/sm90_sparse_gemm_compressor_f32.cu
new file mode 100644
index 0000000000..295622b266
--- /dev/null
+++ b/test/unit/transform/device/sm90_sparse_gemm_compressor_f32.cu
@@ -0,0 +1,95 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+#include "cute/atom/mma_traits_sm90_gmma.hpp"                       // cute::GMMA::Major
+#include "cutlass/arch/config.h"                                    // CUTLASS_ARCH_MMA_SM90_SUPPORTED
+#include "cutlass/transform/kernel/sparse_gemm_compressor.hpp"      // StructuredSparseCompressor
+#include "cutlass/transform/device/transform_universal_adapter.hpp" // TransformUniversalAdapter
+#include "cutlass/gemm/collective/builders/sm90_common.inl"         // gmma_ss_tag_to_major_A
+#include "cutlass/gemm/collective/builders/sm90_sparse_config.inl"  // Sm90GemmSparseConfig
+#include "testbed_sparse_gemm_compressor.hpp"                       // TestbedSparseGemmCompressor
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+// * Test Plan
+// ElementA : fp32
+// LayoutA : row / col
+// Gemm : 1x 2x 3x multiplier of alignment requirement. corner case that smaller than alignment requirement
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+#if defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+
+TEST(SM90_Structured_Sparse_Gemm_Compressor_Device, f32_t)
+{
+  // Test Settings
+  using ElementA = float;
+  using LayoutATag = cutlass::layout::RowMajor;
+
+  // Deduct From Test Setting
+  static constexpr cute::GMMA::Major GmmaMajorA = cutlass::gemm::collective::detail::gmma_rs_tag_to_major_A<LayoutATag>();
+  using ElementAMma = cute::sparse_elem<2, ElementA>;
+  using ElementEMma = cute::sparse_elem<4, uint8_t>;
+
+  using SparseConfig = cutlass::Sm90GemmSparseConfig<ElementAMma, GmmaMajorA, ElementEMma, cute::Int<16>>;
+
+  using CompressorKernel = cutlass::transform::kernel::
+      StructuredSparseCompressor<cute::Shape<int, int, int, int>, ElementA, LayoutATag, SparseConfig, cutlass::arch::Sm90>;
+
+  using Compressor = cutlass::transform::device::TransformUniversalAdapter<CompressorKernel>;
+
+  // Test Bed
+  test::transform::device::TestbedSparseGemmCompressor<Compressor> testbed;
+  EXPECT_TRUE(testbed.run_auto());
+}
+
+TEST(SM90_Structured_Sparse_Gemm_Compressor_Device, f32_n)
+{
+  // Test Settings
+  using ElementA = cutlass::tfloat32_t;
+  using LayoutATag = cutlass::layout::ColumnMajor;
+
+  // Deduct From Test Setting
+  static constexpr cute::GMMA::Major GmmaMajorA = cutlass::gemm::collective::detail::gmma_rs_tag_to_major_A<LayoutATag>();
+  using ElementAMma = cute::sparse_elem<2, ElementA>;
+  using ElementEMma = cute::sparse_elem<4, uint8_t>;
+
+  using SparseConfig = cutlass::Sm90GemmSparseConfig<ElementAMma, GmmaMajorA, ElementEMma, cute::Int<32>>;
+
+  using CompressorKernel = cutlass::transform::kernel::
+      StructuredSparseCompressor<cute::Shape<int, int, int, int>, ElementA, LayoutATag, SparseConfig, cutlass::arch::Sm90>;
+
+  using Compressor = cutlass::transform::device::TransformUniversalAdapter<CompressorKernel>;
+
+  // Test Bed
+  test::transform::device::TestbedSparseGemmCompressor<Compressor> testbed;
+  EXPECT_TRUE(testbed.run_auto());
+}
+
+#endif // #if defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
diff --git a/test/unit/transform/device/sm90_sparse_gemm_compressor_f8.cu b/test/unit/transform/device/sm90_sparse_gemm_compressor_f8.cu
new file mode 100644
index 0000000000..0371471046
--- /dev/null
+++ b/test/unit/transform/device/sm90_sparse_gemm_compressor_f8.cu
@@ -0,0 +1,95 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+#include "cute/atom/mma_traits_sm90_gmma.hpp"                       // cute::GMMA::Major
+#include "cutlass/arch/config.h"                                    // CUTLASS_ARCH_MMA_SM90_SUPPORTED
+#include "cutlass/transform/kernel/sparse_gemm_compressor.hpp"      // StructuredSparseCompressor
+#include "cutlass/transform/device/transform_universal_adapter.hpp" // TransformUniversalAdapter
+#include "cutlass/gemm/collective/builders/sm90_common.inl"         // gmma_ss_tag_to_major_A
+#include "cutlass/gemm/collective/builders/sm90_sparse_config.inl"  // Sm90GemmSparseConfig
+#include "testbed_sparse_gemm_compressor.hpp"                       // TestbedSparseGemmCompressor
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+// * Test Plan
+// ElementA : fp8
+// LayoutA : row / col
+// Gemm : 1x 2x 3x multiplier of alignment requirement. corner case that smaller than alignment requirement
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+#if defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
+
+TEST(SM90_Structured_Sparse_Gemm_Compressor_Device, f8_t)
+{
+  // Test Settings
+  using ElementA = cutlass::float_e4m3_t;
+  using LayoutATag = cutlass::layout::RowMajor;
+
+  // Deduct From Test Setting
+  static constexpr cute::GMMA::Major GmmaMajorA = cutlass::gemm::collective::detail::gmma_rs_tag_to_major_A<LayoutATag>();
+  using ElementAMma = cute::sparse_elem<2, ElementA>;
+  using ElementEMma = cute::sparse_elem<8, uint8_t>;
+
+  using SparseConfig = cutlass::Sm90GemmSparseConfig<ElementAMma, GmmaMajorA, ElementEMma, cute::Int<64>>;
+
+  using CompressorKernel = cutlass::transform::kernel::
+      StructuredSparseCompressor<cute::Shape<int, int, int, int>, ElementA, LayoutATag, SparseConfig, cutlass::arch::Sm90>;
+
+  using Compressor = cutlass::transform::device::TransformUniversalAdapter<CompressorKernel>;
+
+  // Test Bed
+  test::transform::device::TestbedSparseGemmCompressor<Compressor> testbed;
+  EXPECT_TRUE(testbed.run_auto());
+}
+
+TEST(SM90_Structured_Sparse_Gemm_Compressor_Device, f8_n)
+{
+  // Test Settings
+  using ElementA = cutlass::float_e5m2_t;
+  using LayoutATag = cutlass::layout::ColumnMajor;
+
+  // Deduct From Test Setting
+  static constexpr cute::GMMA::Major GmmaMajorA = cutlass::gemm::collective::detail::gmma_rs_tag_to_major_A<LayoutATag>();
+  using ElementAMma = cute::sparse_elem<2, ElementA>;
+  using ElementEMma = cute::sparse_elem<8, uint8_t>;
+
+  using SparseConfig = cutlass::Sm90GemmSparseConfig<ElementAMma, GmmaMajorA, ElementEMma, cute::Int<64>>;
+
+  using CompressorKernel = cutlass::transform::kernel::
+      StructuredSparseCompressor<cute::Shape<int, int, int, int>, ElementA, LayoutATag, SparseConfig, cutlass::arch::Sm90>;
+
+  using Compressor = cutlass::transform::device::TransformUniversalAdapter<CompressorKernel>;
+
+  // Test Bed
+  test::transform::device::TestbedSparseGemmCompressor<Compressor> testbed;
+  EXPECT_TRUE(testbed.run_auto());
+}
+
+#endif // #if defined(CUTLASS_ARCH_MMA_SM90_SUPPORTED)
diff --git a/test/unit/transform/device/sm90_sparse_gemm_compressor_legacy.hpp b/test/unit/transform/device/sm90_sparse_gemm_compressor_legacy.hpp
new file mode 100644
index 0000000000..3a42d74ffc
--- /dev/null
+++ b/test/unit/transform/device/sm90_sparse_gemm_compressor_legacy.hpp
@@ -0,0 +1,480 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+/*! \file
+  \brief Compress utils specific for SM90 structure sparse kernels
+*/
+
+#pragma once
+
+#include <algorithm>                       // std::fill
+#include <array>                           // std::array
+#include <cstdio>
+#include <random>                          // std::mt19937
+
+#include "cute/container/bit_field.hpp"    // cute::bit_field
+#include "cute/numeric/numeric_types.hpp"  // cute::sizeof_bits_v
+#include "cute/tensor.hpp"                 // cute::Tensor, cute::make_tensor, cute::print_tensor
+#include "cutlass/arch/arch.h"             // cutlass::arch::Sm90
+#include "cutlass/cutlass.h"               // cutlass::Status
+#include "cutlass/detail/layout.hpp"       // cutlass::TagToStrideA_t
+#include "cutlass/fast_math.h"             // cutlass::ceil_div, cutlass::round_up
+#include "cutlass/kernel_hardware_info.h"  // cutlass::KernelHardwareInfo
+#include "cutlass/util/packed_stride.hpp"  // cutlass::make_cute_packed_stride
+#include "cutlass/numeric_size.h"          // cutlass::bits_to_bytes
+#include "cutlass/cuda_host_adapter.hpp"   // cutlass::CudaHostAdapter
+
+namespace cutlass
+{
+namespace transform
+{
+namespace kernel
+{
+
+using namespace cute;
+
+namespace detail {
+
+  template<typename T>
+  CUTLASS_HOST_DEVICE
+  static uint8_t
+  encode_in_chunk_idx_legacy(int in_chunk_idx){
+    if (sizeof(T) == 4) {
+      return in_chunk_idx == 0 ? 0b0100 : 0b1110;
+    }
+    else {
+      uint8_t res = 0;
+      if (in_chunk_idx == 0) {
+        res = 0b00;
+      }
+      else if (in_chunk_idx == 1) {
+        res = 0b01;
+      }
+      else if (in_chunk_idx == 2) {
+        res = 0b10;
+      }
+      else {
+        res = 0b11;
+      }
+      return res;
+    }
+  }
+
+  template <
+    class SparseConfig,
+    class EngineA,
+    class LayoutA,
+    class EngineAc,
+    class LayoutAc
+  >
+  CUTLASS_HOST_DEVICE
+  static void
+  compress_two_chunks_legacy(
+    Tensor<EngineA, LayoutA> tensorA,
+    Tensor<EngineAc, LayoutAc> tensorAc,
+    uint8_t& meta_two_chunk,
+    int effective_elems) {
+
+    using ElementA = typename EngineAc::value_type;
+
+    static constexpr int LogicalElemsAPerChunk  = typename SparseConfig::LogicalElemsAPerChunk{};
+    static constexpr int PhysicalElemsAPerChunk  = typename SparseConfig::PhysicalElemsAPerChunk{};
+    static constexpr int ElemsARawPerElementAMmaRaw    = typename SparseConfig::ElemsARawPerElementAMmaRaw{};
+    static constexpr int ElementEBitsPerElementAMma = typename SparseConfig::ElementEBitsPerElementAMma{};
+    static constexpr int LogicalSubChunk     = ceil_div(LogicalElemsAPerChunk, ElemsARawPerElementAMmaRaw);
+    static constexpr int PhysicalSubChunk    = ceil_div(PhysicalElemsAPerChunk, ElemsARawPerElementAMmaRaw);
+
+    /*
+    Legal metadata chunk in SM90
+    Index   Bin   HEX
+    0, 1  0b0100   4
+    1, 2  0b1001   9
+    2, 3  0b1110   E
+    0, 2  0b1000   8
+    1, 3  0b1101   D
+    0, 3  0b1100   C
+    2, 1  0b0110   6  (Not used)
+    -----------------------------------
+    TF32
+    0     0b0100   4
+    1     0b1110   E
+    */
+
+    if (effective_elems <= 0) {
+      return;
+    }
+
+    // initialize
+    // 0 is the initial value for this function while 0x44 is the initial value for hardware.
+    meta_two_chunk = 0;
+
+    for (int chunk_idx = 0; chunk_idx < 2; ++chunk_idx) {
+      // If Only One Chunk within this Two Chunk
+      if ( effective_elems <= chunk_idx * ElemsARawPerElementAMmaRaw * LogicalSubChunk ) {
+        break;
+      }
+      /// init result;
+      int non_zero_cnt = 0;
+      int32_t nnz_chunk_idx[PhysicalSubChunk] = { 0 };
+      ElementA Ac_chunk[PhysicalSubChunk][ElemsARawPerElementAMmaRaw] = { ElementA{0} };
+
+      for (int subchunk_idx = 0; subchunk_idx < LogicalSubChunk; ++subchunk_idx) {
+        bool is_nz = true;
+        ElementA subchunk_elems[ElemsARawPerElementAMmaRaw] = { ElementA{0} };
+        /// Check if subchunk is non-zero
+        for(int elem_idx = 0; elem_idx < ElemsARawPerElementAMmaRaw; elem_idx++) {
+          int offset = chunk_idx * LogicalElemsAPerChunk + subchunk_idx * ElemsARawPerElementAMmaRaw + elem_idx;
+          subchunk_elems[elem_idx] = offset < effective_elems ? tensorA(offset) : ElementA(0);
+          
+          if (subchunk_elems[elem_idx] != ElementA(0)) {
+            if (non_zero_cnt >= PhysicalSubChunk) {
+              #ifdef  __CUDA_ARCH__
+                asm volatile ("brkpt;\n" ::);
+              #else
+                throw std::runtime_error("Found extra non-zero elements in a chunk!\n");
+              #endif
+            }
+            is_nz = false;
+          }
+        }
+
+        /// There is non-zero element in the subchunk
+        if(!is_nz) {
+          nnz_chunk_idx[non_zero_cnt] = subchunk_idx;
+          memcpy(Ac_chunk[non_zero_cnt], subchunk_elems, sizeof(ElementA) * ElemsARawPerElementAMmaRaw);
+          non_zero_cnt++;
+        }
+      }
+
+      /*
+      Special cases
+      nnz == 1 and non-tf32 and nnz_idx = 3
+      */
+      ElementA elementA_zeros[ElemsARawPerElementAMmaRaw] = { ElementA{0} };
+      if constexpr (sizeof_bits_v<ElementA> < 32) {
+        if (non_zero_cnt == 1 && nnz_chunk_idx[0] == 3) {
+          memcpy(Ac_chunk[1], Ac_chunk[0], sizeof(ElementA) * ElemsARawPerElementAMmaRaw);
+          memcpy(Ac_chunk[0], elementA_zeros, sizeof(ElementA) * ElemsARawPerElementAMmaRaw);
+          nnz_chunk_idx[1] = 3;
+          nnz_chunk_idx[0] = 0;
+        }
+        else if (non_zero_cnt == 1) {
+          memcpy(Ac_chunk[1], elementA_zeros, sizeof(ElementA) * ElemsARawPerElementAMmaRaw);
+          nnz_chunk_idx[1] = 3;
+        }
+      }
+
+      /// Setup metadata
+      uint8_t meta_chunk = 0;
+      for (int i = 0; i < PhysicalSubChunk; i++) {
+        meta_chunk = static_cast<uint8_t>(meta_chunk | (encode_in_chunk_idx_legacy<ElementA>(nnz_chunk_idx[i]) << (i * ElementEBitsPerElementAMma)));
+        for(int j = 0; j < ElemsARawPerElementAMmaRaw; j++) {
+          tensorAc(chunk_idx * PhysicalElemsAPerChunk + i * ElemsARawPerElementAMmaRaw + j) = Ac_chunk[i][j];
+        }
+      }
+      meta_two_chunk = uint8_t(meta_two_chunk | (meta_chunk << (chunk_idx * _4{})));
+    }
+  }
+}
+
+template<
+  class ProblemShape_,
+  class ElementA_,
+  class LayoutATag_,
+  class SparseConfig_
+>
+class SM90StructuredSparseCompressorLegacy {
+public:
+  using SparseConfig = SparseConfig_;
+  using ProblemShape = ProblemShape_;
+
+  // * EltA
+  using ElementA = ElementA_;
+  using ElementAUint = cute::uint_bit_t<cute::sizeof_bits_v<ElementA>>;
+  static constexpr bool IsRuntimeDataTypeA = cute::is_same_v<ElementA, cutlass::type_erased_dynamic_float8_t> ||
+                                             cute::is_same_v<ElementA, cutlass::type_erased_dynamic_float6_t> ||
+                                             cute::is_same_v<ElementA, cutlass::type_erased_dynamic_float4_t>;
+  using ArrayElementA = cute::conditional_t<IsRuntimeDataTypeA,
+                                            cute::uint_bit_t<cute::sizeof_bits_v<ElementA>>,
+                                            ElementA>;
+  using ElementAMma = typename SparseConfig::ElementAMma;
+  using ElementAMmaRaw = typename SparseConfig::ElementAMmaRaw;
+  using ElementASparsity = typename SparseConfig::ElementASparsity;
+  using ElementAMmaSparsity = typename SparseConfig::ElementAMmaSparsity;
+  using LayoutATag = LayoutATag_;
+  using LayoutA = LayoutATag;
+  using StrideA = cutlass::gemm::TagToStrideA_t<LayoutATag>;
+
+  // * EltE
+  using ElementEMma = typename SparseConfig::ElementEMma;
+  using ElementEMmaRaw = typename SparseConfig::ElementEMmaRaw;
+  using ElementEMmaSparsity = typename SparseConfig::ElementEMmaSparsity;
+
+  // * AtomE
+  using TensorEAtom = typename SparseConfig::TensorEAtom;
+  using TensorEAtomK = typename SparseConfig::TensorEAtomK;
+  using TensorEAtomM = typename SparseConfig::TensorEAtomM;
+
+  static constexpr int ElemsARawPerElementAMmaRaw = typename SparseConfig::ElemsARawPerElementAMmaRaw{};
+  static constexpr int LogicalElemsAPerChunk = typename SparseConfig::LogicalElemsAPerChunk{};
+  static constexpr int PhysicalElemsAPerChunk = typename SparseConfig::PhysicalElemsAPerChunk{};
+  static constexpr int LogicalElemsAMmaRawPerChunk = cutlass::ceil_div(LogicalElemsAPerChunk, ElemsARawPerElementAMmaRaw);
+  static constexpr int PhysicalElemsAMmaRawPerChunk = cutlass::ceil_div(PhysicalElemsAPerChunk, ElemsARawPerElementAMmaRaw);
+
+  // * Alignment
+  static constexpr int TensorEAlignmentM = typename SparseConfig::TensorEAlignmentM{};
+  static constexpr int TensorEAlignmentK = typename SparseConfig::TensorEAlignmentK{};
+  static constexpr int TensorAAlignmentK = typename SparseConfig::TensorAAlignmentK{};
+  static constexpr int TensorAAlignmentM = typename SparseConfig::TensorAAlignmentM{};
+
+  // Required by `device_kernel`
+  static constexpr int MaxThreadsPerBlock = 1;
+  static constexpr int MinBlocksPerMultiprocessor = 1;
+  using ArchTag = arch::Sm90;
+
+  struct SharedStorage {
+    /* empty, no smem needed */
+  };
+
+  static constexpr int SharedStorageSize = sizeof(SharedStorage);
+
+  struct TransformArguments {
+    ArrayElementA const* ptr_A{nullptr};
+    StrideA dA{};
+    ArrayElementA* ptr_ACompress{nullptr};
+    ElementEMmaRaw* ptr_E{nullptr};
+  };
+
+  using TransformParams = TransformArguments;
+
+  struct Arguments {
+    ProblemShape problem_shape{};
+    TransformArguments transform{};
+    KernelHardwareInfo hw_info{};
+  };
+
+  struct Params {
+    ProblemShape problem_shape{};
+    TransformParams transform{};
+    KernelHardwareInfo hw_info{};
+    void* workspace = nullptr;
+  };
+
+  static Params
+  to_underlying_arguments(Arguments & args, void* workspace) {
+    return Params{{args.problem_shape},
+                  {args.transform.ptr_A, args.transform.dA, args.transform.ptr_ACompress, args.transform.ptr_E},
+                  {args.hw_info},
+                  workspace};
+  }
+
+  static Status
+  can_implement(Arguments const& args) {
+    auto [M, N, K, L] = args.problem_shape;
+    if (K % LogicalElemsAPerChunk != 0) {
+      CUTLASS_TRACE_HOST("SM90 Sparse Compressor CAN NOT IMPLEMENT: GemmK not multiplier of logical chunk size\n");
+      return Status::kErrorInvalidProblem;
+    }
+
+    return Status::kSuccess;
+  }
+
+  static size_t
+  get_workspace_size(Arguments const& args) {
+    auto problem = args.problem_shape;
+    const int m = cute::size<0>(problem);
+    const int k = cute::size<2>(problem);
+    const int l = cute::size<3>(problem);
+    const int metadata_k = round_up(k, TensorEAlignmentK);
+    const int metadata_m = round_up(m, TensorEAlignmentM);
+    const int metadata_bytes = metadata_m * metadata_k / ElementEMmaSparsity{} * l;
+    return metadata_bytes;
+  }
+
+  static Status
+  initialize_workspace(Arguments const& args, void* workspace = nullptr, cudaStream_t stream = nullptr,
+    CudaHostAdapter *cuda_adapter = nullptr) {
+    cudaError_t cuda_error;
+
+    auto workspace_size = get_workspace_size(args);
+    if (workspace_size == 0) {
+      return Status::kSuccess;
+    } else if (workspace == nullptr) {
+      return Status::kErrorInternal;
+    }
+
+    cudaPointerAttributes attri;
+    cuda_error = cudaPointerGetAttributes(&attri, workspace);
+    if (cuda_error != cudaSuccess) {
+      return Status::kErrorInternal;
+    }
+
+    if ( attri.type == cudaMemoryTypeDevice ) {
+#if defined(CUTLASS_ENABLE_CUDA_HOST_ADAPTER) && CUTLASS_ENABLE_CUDA_HOST_ADAPTER
+      CUTLASS_ASSERT(cuda_adapter);
+      if (Status::kSuccess != cuda_adapter->memsetDevice(workspace, static_cast<uint8_t>(0), workspace_size, stream)) {
+        return Status::kErrorInternal;
+      }
+#else
+      cudaMemsetAsync(workspace, 0, workspace_size, stream);
+      cuda_error = cudaGetLastError();
+      if (cuda_error != cudaSuccess) {
+        return Status::kErrorInternal;
+      }
+#endif
+    } else {
+      memset(workspace, 0, workspace_size);
+    }
+
+    return Status::kSuccess;
+  }
+
+  static dim3
+  get_grid_shape(Params const& params) {
+    return dim3(1, 1, 1);
+  }
+
+  static dim3
+  get_block_shape() {
+    return dim3(1, 1, 1);
+  }
+
+  CUTE_HOST_DEVICE
+  void
+  operator()(Params params, char* smem_buf = nullptr) {
+    run(params, smem_buf);
+  }
+
+  CUTE_HOST_DEVICE
+  static void
+  run(Params params, char* smem_buf = nullptr) {
+    do_compress_device_host(params);
+  }
+
+private:
+
+  CUTE_HOST_DEVICE
+  static void
+  do_compress_device_host(Params params) {
+    auto [m, n, k, l] = params.problem_shape;
+    auto [ptr_A, dA, ptr_ACompress, ptr_E] = params.transform;
+    auto workspace = params.workspace;
+
+    const int aligned_k = (k + TensorAAlignmentK - 1) / TensorAAlignmentK * TensorAAlignmentK;
+    const int aligned_m = (m + TensorAAlignmentM - 1) / TensorAAlignmentM * TensorAAlignmentM;
+    const int metadata_k = (k + TensorEAlignmentK - 1) / TensorEAlignmentK * TensorEAlignmentK;
+    const int metadata_m = (m + TensorEAlignmentM - 1) / TensorEAlignmentM * TensorEAlignmentM;
+    const int k_compressed = aligned_k / ElementASparsity{};
+
+    // Convert to CuTe tensors. But don't want to use sparse_ptr, which is making everything complicated here.
+    cute::Tensor tensorA = make_tensor(recast_ptr<ElementAUint>(ptr_A), make_layout(make_shape(m, k, l), dA));
+
+    cute::Tensor tensorAc = make_tensor(recast_ptr<ElementAUint>(ptr_ACompress),
+                      make_shape(aligned_m, k_compressed, l),
+                      make_cute_packed_stride(StrideA{}, cute::make_shape(aligned_m, k_compressed, l)));
+
+    cute::Tensor tensorE_raw_compress_logical = make_tensor(recast_ptr<sparse_elem<ElementEMmaSparsity{},ElementEMmaRaw>>(workspace),
+                                make_shape(metadata_m, make_shape(TensorEAtomK{}, metadata_k / TensorEAtomK{}), l),
+                                make_stride(TensorEAtomK{}, make_stride(_1{}, metadata_m*TensorEAtomK{}), metadata_m*metadata_k));
+
+    cute::Tensor tensorE_raw_compress = recast<uint8_t>(tensorE_raw_compress_logical);
+
+    // The following vars are all logical.
+    int atom_m = size<0>(TensorEAtom{});
+    int atom_k = size<1>(TensorEAtom{});
+    int tiled_m = metadata_m / atom_m;
+    int tiled_ke = metadata_k / atom_k;
+    // Col major when viewing atoms
+    int stride_tile_m = cosize(TensorEAtom{});
+    int stride_tile_ke = atom_k * metadata_m;
+
+    // Logical metadata tensor
+    cute::Tensor tensorE_logical = make_tensor(recast_ptr<sparse_elem<ElementEMmaSparsity{},ElementEMmaRaw>>(ptr_E),
+                           make_layout(make_shape(append(shape<0>(TensorEAtom{}), tiled_m),
+                                       append(shape<1>(TensorEAtom{}), tiled_ke),
+                                       shape<2>(tensorE_raw_compress_logical)),
+                                 make_stride(append(stride<0>(TensorEAtom{}), stride_tile_m),
+                                       append(stride<1>(TensorEAtom{}), stride_tile_ke),
+                                       stride<2>(tensorE_raw_compress_logical))));
+    // Physical metadata tensor
+    cute::Tensor tensorE = recast<uint8_t>(tensorE_logical);
+
+    // void do_init()
+    cute::clear(tensorAc);
+    cute::clear(tensorE_raw_compress);
+
+    // void do_raw_compress()
+    using TileStepA = Int<LogicalElemsAPerChunk * 2>;
+    using TileStepAc = Int<TileStepA{} / 2>;
+
+    cute::Tensor tensorATiled = logical_divide(tensorA, make_shape(_, TileStepA{}, _));
+    cute::Tensor tensorAcTiled = logical_divide(tensorAc, make_shape(_, TileStepAc{}, _));
+
+    for (int batch_idx = 0; batch_idx < l; batch_idx++) {
+      for (int m_idx = 0; m_idx < m; m_idx++) {
+        for (int tiler_k_idx = 0; tiler_k_idx < size<1,1>(tensorATiled); tiler_k_idx++) {
+          int effective_elems = cute::min(TileStepA{}, k - (tiler_k_idx * TileStepA{}));
+          detail::compress_two_chunks_legacy<SparseConfig>(tensorATiled(m_idx, make_coord(_, tiler_k_idx), batch_idx),
+                                                     tensorAcTiled(m_idx, make_coord(_, tiler_k_idx), batch_idx),
+                                                     tensorE_raw_compress(m_idx, tiler_k_idx, batch_idx),
+                                                     effective_elems);
+        }
+      }
+    }
+
+    // void do_reorder()
+    // Fast path when we don't permute.
+    if constexpr (sizeof_bits_v<ElementAUint> <= 8) {
+      memcpy(tensorE.data(), tensorE_raw_compress.data(), tensorE.size());
+    }
+    else {
+      cute::copy(tensorE_raw_compress, tensorE);
+    }
+
+    #if 0
+    print("--> TensorA\n");
+    auto tensorA_eltA = cute::recast<ElementA>(tensorA);
+    cute::print_tensor(tensorA_eltA); printf("\n\n");
+
+    print("--> REF TensorAC\n");
+    auto tensorAc_eltA = cute::recast<ElementA>(tensorAc);
+    cute::print_tensor(tensorAc_eltA); printf("\n\n");
+
+    print("--> REF TensorE\n");
+    cute::print_tensor(tensorE); printf("\n\n");
+    #endif
+
+  }
+};
+
+}  // namespace kernel
+}  // namespace transform
+}  // namespace cutlass
diff --git a/test/unit/transform/device/testbed_sparse_gemm_compressor.hpp b/test/unit/transform/device/testbed_sparse_gemm_compressor.hpp
new file mode 100644
index 0000000000..af50348e8c
--- /dev/null
+++ b/test/unit/transform/device/testbed_sparse_gemm_compressor.hpp
@@ -0,0 +1,876 @@
+/***************************************************************************************************
+ * Copyright (c) 2024 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+
+/*
+ * @brief Test for structured sparse gemm compressor device kernel
+ */
+
+#pragma once
+
+#include <cuda_runtime_api.h>  // cudaGetLastError
+
+#include <cstdint>             // uint64_t
+#include <cstdio>              // printf
+#include <cstdlib>             // malloc
+#include <iostream>            // std::cout
+#include <vector>
+#include <array>
+
+#include "cute/layout.hpp"                                    // cute::make_shape
+#include "cute/util/type_traits.hpp"                          // cute::is_same_v
+#include "cutlass/coord.h"                                    // cutlass::make_Coord
+#include "cutlass/cutlass.h"                                  // cutlass::Status
+#include "cutlass/kernel_hardware_info.hpp"                          // cutlass::KernelHardwareInfo
+#include "cutlass/layout/matrix.h"                                   // cutlass::layout::Affine2Layout_Factory
+#include "cutlass/numeric_types.h"                                   // cutlass::sizeof_bits, cutlass::float_
+#include "cutlass/tensor_view.h"                                     // cutlass::TensorView
+#include "cutlass/transform/device/transform_universal_adapter.hpp"  // cutlass::transform::device::TransformUniversalAdapter
+#include "cutlass/transform/kernel/sparse_gemm_compressor.hpp"       // cutlass::transform::kernel::StructuredSparseCompressorUtility
+#include "cutlass/util/device_memory.h"                              // cutlass::device_memory::allocation
+#include "cutlass/util/distribution.h"                               // cutlass::Distribution
+#include "cutlass/util/host_tensor.h"                                // cutlass::HostTensor
+#include "cutlass/util/packed_stride.hpp"                            // cutlass::make_cute_packed_stride
+#include "cutlass/util/reference/host/tensor_compare.h"              // cutlass::reference::host::TensorEquals
+#include "cutlass/util/reference/host/tensor_fill.h"  // cutlass::reference::host::TensorFillRandomUniform, TensorFillIdentity, TensorFillRandomGaussian, BlockFillSequential, TensorFill
+
+#include "sm90_sparse_gemm_compressor_legacy.hpp"     // Legacy host compressor
+#include "../../common/cutlass_unit_test.h"           // CUTLASS UT, EXPECT_TRUE
+
+
+#define CUDA_CHECK_FALSE(cuda_error)                                                           \
+  {                                                                                            \
+    if (cuda_error != cudaSuccess) {                                                           \
+      printf("cudaError %s in %s:%d\n", cudaGetErrorString(cuda_error), __func__, __LINE__ );  \
+      return false;                                                                            \
+    }                                                                                          \
+  }
+
+#define CUDA_CHECK(cuda_error)                                                                 \
+  {                                                                                            \
+    if (cuda_error != cudaSuccess) {                                                           \
+      printf("cudaError %s in %s:%d\n", cudaGetErrorString(cuda_error), __func__, __LINE__ );  \
+      return;                                                                                  \
+    }                                                                                          \
+  }
+
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+// * Test Bed
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+namespace test
+{
+namespace transform
+{
+namespace device
+{
+
+// Helper Functions
+template <typename Element, typename Layout>
+bool
+initialize_tensor(cutlass::TensorView<Element, Layout> view, cutlass::Distribution::Kind dist_kind, uint64_t seed)
+{
+  if (dist_kind == cutlass::Distribution::Uniform) {
+    double scope_max, scope_min;
+    int bits_input = cutlass::sizeof_bits<Element>::value;
+
+    if (bits_input == 1) {
+      scope_max = 2;
+      scope_min = 0;
+    }
+    else if (bits_input <= 8) {
+        scope_max = 1;
+        scope_min = -1;
+    } else {
+      scope_max = 4;
+      scope_min = -4;
+    }
+    cutlass::reference::host::TensorFillRandomUniform(view, seed, scope_max, scope_min, 0);
+  }
+
+  else if (dist_kind == cutlass::Distribution::Identity) {
+    cutlass::reference::host::TensorFillIdentity(view);
+  }
+
+  else if (dist_kind == cutlass::Distribution::Gaussian) {
+    cutlass::reference::host::TensorFillRandomGaussian(view, seed, 0, 0.5);
+  }
+
+  else if (dist_kind == cutlass::Distribution::Sequential) {
+    cutlass::reference::host::BlockFillSequential(view.data(), view.capacity());
+  }
+
+  else if (dist_kind == cutlass::Distribution::AllOnes) {
+    cutlass::reference::host::TensorFill(view, Element(1));
+  }
+
+  else if (dist_kind == cutlass::Distribution::AllZeros) {
+    cutlass::reference::host::TensorFill(view, Element(0));
+  }
+
+  else {
+    EXPECT_TRUE(false) << "Not implemented";
+    return false;
+  }
+
+  return true;
+}
+
+// Testbed
+template <typename Compressor_>
+struct TestbedSparseGemmCompressor {
+public:
+  using Compressor = Compressor_;
+  using CompressorKernel = typename Compressor::TransformKernel;
+
+  using ElementA = typename CompressorKernel::ElementA;
+  using LayoutATag = typename CompressorKernel::LayoutATag;
+  using StrideA = typename CompressorKernel::StrideA;
+  using ArrayElementA = 
+    ElementA
+  ;
+
+  using ElementE = typename CompressorKernel::ElementEMmaRaw;
+  using LayoutETag = cutlass::layout::RowMajor;  // We don't care about the major here, just to allocate tensor
+
+  using SparseConfig = typename CompressorKernel::SparseConfig;
+  using ProblemShapeType = typename CompressorKernel::ProblemShape;
+
+  using CompressorUtility = cutlass::transform::kernel::StructuredSparseCompressorUtility<
+                              ProblemShapeType,
+                              ElementA,
+                              LayoutATag,
+                              SparseConfig>;
+
+  using CompressorKernelHost = cutlass::transform::kernel::SM90StructuredSparseCompressorLegacy<
+                                ProblemShapeType,
+                                ElementA,
+                                LayoutATag,
+                                SparseConfig>;
+
+  using CompressorHost = cutlass::transform::device::TransformUniversalAdapter<CompressorKernelHost>;
+
+  static constexpr auto LogicalElemsAPerChunk = CompressorKernel::LogicalElemsAPerChunk;
+  static constexpr auto PhysicalElemsAPerChunk = CompressorKernel::PhysicalElemsAPerChunk;
+
+  struct Data {
+    // Data Storage
+    cutlass::HostTensor<ArrayElementA, LayoutATag> tensor_A;
+    cutlass::HostTensor<ArrayElementA, LayoutATag> tensor_A_Comp;
+    cutlass::HostTensor<ElementE, LayoutETag> tensor_E;
+    cutlass::HostTensor<ArrayElementA, LayoutATag> tensor_A_Comp_ref;
+    cutlass::HostTensor<ElementE, LayoutETag> tensor_E_ref;
+  };
+
+  struct CudaRAII {
+    cudaStream_t stream;
+    cudaEvent_t start;
+    cudaEvent_t stop;
+  
+    CudaRAII(){
+      CUDA_CHECK(cudaStreamCreate( &stream ));
+      CUDA_CHECK(cudaEventCreate( &start ));
+      CUDA_CHECK(cudaEventCreate( &stop ));
+    };
+
+    CudaRAII(const CudaRAII&) = delete;  
+    CudaRAII& operator=(const CudaRAII&) = delete;  
+    CudaRAII(CudaRAII&&) = delete;  
+    CudaRAII& operator=(CudaRAII&&) = delete;  
+
+    ~CudaRAII(){
+      CUDA_CHECK(cudaStreamDestroy( stream ));
+      CUDA_CHECK(cudaEventDestroy( start ));
+      CUDA_CHECK(cudaEventDestroy( stop ));
+    }
+  };
+
+public:
+  TestbedSparseGemmCompressor(
+      cutlass::Distribution::Kind init_A_ = cutlass::Distribution::Uniform,
+      cutlass::Distribution::Kind init_E_ = cutlass::Distribution::Uniform,
+      cutlass::Distribution::Kind init_A_Comp_ = cutlass::Distribution::Uniform,
+      uint64_t seed_ = 7)
+      : init_A(init_A_)
+      , init_E(init_E_)
+      , init_A_Comp(init_A_Comp_)
+      , seed(seed_)
+  {
+  }
+
+  bool valid_test(ProblemShapeType problem_shape_MNKL)
+  {
+    const int GemmK = cute::size<2>(problem_shape_MNKL);
+
+    if ( GemmK % LogicalElemsAPerChunk != 0 ) {
+      printf("GemmK needs to be multiplier of LogicalElemsAPerChunk\n");
+      return false;
+    }
+
+    return true;
+  }
+
+  bool initialize(ProblemShapeType problem_shape_MNKL, Data& datas)
+  {
+    CUDA_CHECK_FALSE(cudaGetLastError());
+
+    // In unit of ElementARaw
+    const int GemmM = cute::size<0>(problem_shape_MNKL);
+    const int GemmN = cute::size<1>(problem_shape_MNKL);
+    const int GemmK = cute::size<2>(problem_shape_MNKL);
+    const int GemmL = cute::size<3>(problem_shape_MNKL);
+
+    // Compressor utility to get allocated data size
+    auto stride_a = cutlass::make_cute_packed_stride(StrideA{}, cute::make_shape(GemmM, GemmK, GemmL));
+    CompressorUtility compressor_utility(problem_shape_MNKL, stride_a);
+
+    // TensorA
+    // In unit of ElementARaw, after alignment requirement
+    // M-dim: no alignment requirement
+    // K-dim: multiplier of chunk size
+
+    // TensorA Compressed
+    // In unit of ElementARaw, after alignment requirement
+    // M-dim: TMA alignment
+    // K-dim: TMA alignment
+    const int GemmMAlignedAC = compressor_utility.get_tensorA_m_physical();
+    const int GemmKAlignedAC = compressor_utility.get_tensorA_k_physical();
+
+    // TensorE
+    // In unit of ElementE (uint8_t), after alignment requirement
+    // M-dim: TensorEAtom_M alignment
+    // K-dim: TensorEAtom_K alignment
+    const int GemmMAlignedE = compressor_utility.get_metadata_m_physical();
+    const int GemmKAlignedE = compressor_utility.get_metadata_k_physical();
+
+    auto a_coord = cutlass::make_Coord(GemmM * GemmL, GemmK);
+    auto e_coord = cutlass::make_Coord(GemmMAlignedE * GemmL, GemmKAlignedE);
+    auto a_comp_coord = cutlass::make_Coord(GemmMAlignedAC * GemmL, GemmKAlignedAC);
+
+    typename LayoutATag::Stride stride_factor_A;
+    typename LayoutETag::Stride stride_factor_E;
+
+    datas.tensor_A.resize(a_coord,
+                          cutlass::layout::Affine2Layout_Factory<LayoutATag>::layout_factory(a_coord, stride_factor_A));
+    datas.tensor_A_Comp.resize(a_comp_coord,
+                               cutlass::layout::Affine2Layout_Factory<LayoutATag>::layout_factory(a_comp_coord, stride_factor_A));
+    datas.tensor_A_Comp_ref.resize(a_comp_coord,
+                                   cutlass::layout::Affine2Layout_Factory<LayoutATag>::layout_factory(a_comp_coord, stride_factor_A),
+                                   false);
+    datas.tensor_E.resize(e_coord,
+                          cutlass::layout::Affine2Layout_Factory<LayoutETag>::layout_factory(e_coord, stride_factor_E));
+    datas.tensor_E_ref.resize(e_coord,
+                              cutlass::layout::Affine2Layout_Factory<LayoutETag>::layout_factory(e_coord, stride_factor_E),
+                              false);
+
+    EXPECT_TRUE(initialize_tensor(datas.tensor_A.host_view(), init_A, seed + 1));
+    EXPECT_TRUE(initialize_tensor(datas.tensor_E.host_view(), init_E, seed + 2));
+    EXPECT_TRUE(initialize_tensor(datas.tensor_E_ref.host_view(), init_E, seed + 3));
+    EXPECT_TRUE(initialize_tensor(datas.tensor_A_Comp.host_view(), init_A_Comp, seed + 4));
+    EXPECT_TRUE(initialize_tensor(datas.tensor_A_Comp_ref.host_view(), init_A_Comp, seed + 5));
+
+    compressor_utility.structure_sparse_zero_mask_fill(datas.tensor_A.host_data(), seed + 6);
+
+    // Check for failed devide
+    CUDA_CHECK_FALSE(cudaGetLastError());
+
+    datas.tensor_A.sync_device();
+    datas.tensor_A_Comp.sync_device();
+    datas.tensor_E.sync_device();
+
+    // Check for failed devide
+    CUDA_CHECK_FALSE(cudaGetLastError());
+
+    return true;
+  }
+
+  bool run_device(ProblemShapeType problem_shape_MNKL, Data& datas, float* time = nullptr)
+  {
+    CudaRAII cuda_raii;
+
+    const int GemmM = cute::size<0>(problem_shape_MNKL);
+    const int GemmN = cute::size<1>(problem_shape_MNKL);
+    const int GemmK = cute::size<2>(problem_shape_MNKL);
+    const int GemmL = cute::size<3>(problem_shape_MNKL);
+
+    StrideA stride_a = cutlass::make_cute_packed_stride(StrideA{}, cute::make_shape(GemmM, GemmK, GemmL));
+
+    cutlass::KernelHardwareInfo hw_info;
+    hw_info.device_id = 0;
+    hw_info.sm_count = cutlass::KernelHardwareInfo::query_device_multiprocessor_count(hw_info.device_id);
+    typename Compressor::Arguments arguments{
+        {GemmM, GemmN, GemmK, GemmL},
+        {datas.tensor_A.device_data(),
+         stride_a,
+         datas.tensor_A_Comp.device_data(),
+         datas.tensor_E.device_data()},
+        {hw_info}
+    };
+
+    Compressor compressor_op;
+    size_t workspace_size = Compressor::get_workspace_size(arguments);
+    cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);
+
+    cutlass::Status status {cutlass::Status::kSuccess };
+
+    status = compressor_op.can_implement(arguments);
+    if (status != cutlass::Status::kSuccess) {
+      CUDA_CHECK_FALSE(cudaGetLastError());
+    }
+
+    status = compressor_op.initialize(arguments, workspace.get(), cuda_raii.stream);
+    if (status != cutlass::Status::kSuccess) {
+      CUDA_CHECK_FALSE(cudaGetLastError());
+    }
+
+    CUDA_CHECK_FALSE(cudaStreamSynchronize(cuda_raii.stream));
+    CUDA_CHECK_FALSE(cudaEventRecord(cuda_raii.start, cuda_raii.stream));
+
+    status = compressor_op.run(cuda_raii.stream);
+    if (status != cutlass::Status::kSuccess) {
+      CUDA_CHECK_FALSE(cudaGetLastError());
+    }
+
+    CUDA_CHECK_FALSE(cudaEventRecord(cuda_raii.stop, cuda_raii.stream));
+    CUDA_CHECK_FALSE(cudaEventSynchronize(cuda_raii.stop));
+    CUDA_CHECK_FALSE(cudaStreamSynchronize(cuda_raii.stream));
+    if ( time != nullptr ){
+      CUDA_CHECK_FALSE(cudaEventElapsedTime(time, cuda_raii.start, cuda_raii.stop));
+    }
+
+    datas.tensor_A_Comp.sync_host();
+    datas.tensor_E.sync_host();
+
+    #if 0
+    {
+      printf("\n--> DEVICE OUTPUT\n");
+      printf("datas.tensor_A\n");
+      std::cout << datas.tensor_A.host_view() << std::endl << std::endl;
+      printf("datas.tensor_A_Comp\n");
+      std::cout << datas.tensor_A_Comp.host_view() << std::endl << std::endl;
+      printf("datas.tensor_E\n");
+      std::cout << datas.tensor_E.host_view() << std::endl << std::endl;
+    }
+    #endif
+
+    return true;
+  }
+
+  bool run_host_ref(ProblemShapeType problem_shape_MNKL, Data& datas)
+  {
+    const int GemmM = cute::size<0>(problem_shape_MNKL);
+    const int GemmN = cute::size<1>(problem_shape_MNKL);
+    const int GemmK = cute::size<2>(problem_shape_MNKL);
+    const int GemmL = cute::size<3>(problem_shape_MNKL);
+
+    StrideA stride_a = cutlass::make_cute_packed_stride(StrideA{}, cute::make_shape(GemmM, GemmK, GemmL));
+
+    typename CompressorKernelHost::Arguments arguments{
+        {GemmM, GemmN, GemmK, GemmL},
+        {datas.tensor_A.host_data(),
+         stride_a,
+         datas.tensor_A_Comp_ref.host_data(),
+         datas.tensor_E_ref.host_data()},
+        {}};
+
+    const auto can_imp = CompressorKernelHost::can_implement(arguments);
+    if (can_imp != cutlass::Status::kSuccess) {
+      printf("can_implement() check failed\n");
+      return false;
+    }
+
+    // Relies on std::vector for RAII
+    auto workspace_size =
+        static_cast<std::vector<uint8_t>::size_type>(CompressorKernelHost::get_workspace_size(arguments));
+    std::vector<uint8_t> workspace_vector(workspace_size);
+    auto workspace = static_cast<void*>(workspace_vector.data());
+
+    cutlass::Status status = CompressorKernelHost::initialize_workspace(arguments, workspace);
+    if (status != cutlass::Status::kSuccess) {
+      printf("initialize_workspace() failed\n");
+      return false;
+    }
+
+    auto params = CompressorKernelHost::to_underlying_arguments(arguments, workspace);
+    CompressorKernelHost::run(params);
+
+    return true;
+  }
+
+  bool compare_reference(Data& datas)
+  {
+    bool check_tensor_a_compressed =
+        cutlass::reference::host::TensorEquals(datas.tensor_A_Comp_ref.host_view(), datas.tensor_A_Comp.host_view());
+    if (!check_tensor_a_compressed) {
+      printf("A-Compressed Mismatch\n");
+    }
+
+    bool check_tensor_e = cutlass::reference::host::TensorEquals(datas.tensor_E_ref.host_view(), datas.tensor_E.host_view());
+    if (!check_tensor_e) {
+      printf("E Mismatch\n");
+    }
+
+    return check_tensor_a_compressed && check_tensor_e;
+  }
+
+  bool run_auto_small()
+  {
+    return run_auto(true);
+  }
+
+  bool run_auto(bool run_small = false)
+  {
+    constexpr auto TensorEAlignmentM = typename SparseConfig::TensorEAlignmentM{};
+    constexpr auto TensorEAlignmentK = typename SparseConfig::TensorEAlignmentK{};
+    constexpr int LogicalElemsAPerChunk = typename SparseConfig::LogicalElemsAPerChunk{};
+
+    constexpr int GemmN = 1;
+
+    using ProblemType = typename std::array<int, 4>;
+
+    std::vector<ProblemType> problems;
+
+    const std::vector<ProblemType> problems_multiplier_of_tensor_e_atom = {
+      // * Regular Cases (multiplier of TensorEAlignment)
+      {TensorEAlignmentM * 1, GemmN, TensorEAlignmentK * 2, 1},
+      {TensorEAlignmentM * 1, GemmN, TensorEAlignmentK * 2, 1},
+      {TensorEAlignmentM * 1, GemmN, TensorEAlignmentK * 3, 1},
+
+      {TensorEAlignmentM * 2, GemmN, TensorEAlignmentK * 2, 1},
+      {TensorEAlignmentM * 2, GemmN, TensorEAlignmentK * 2, 1},
+      {TensorEAlignmentM * 2, GemmN, TensorEAlignmentK * 3, 1},
+
+      {TensorEAlignmentM * 3, GemmN, TensorEAlignmentK * 2, 1},
+      {TensorEAlignmentM * 3, GemmN, TensorEAlignmentK * 2, 1},
+      {TensorEAlignmentM * 3, GemmN, TensorEAlignmentK * 3, 1},
+
+      {TensorEAlignmentM * 1, GemmN, TensorEAlignmentK * 2, 2},
+      {TensorEAlignmentM * 1, GemmN, TensorEAlignmentK * 2, 2},
+      {TensorEAlignmentM * 1, GemmN, TensorEAlignmentK * 3, 2},
+
+      {TensorEAlignmentM * 2, GemmN, TensorEAlignmentK * 2, 2},
+      {TensorEAlignmentM * 2, GemmN, TensorEAlignmentK * 2, 2},
+      {TensorEAlignmentM * 2, GemmN, TensorEAlignmentK * 3, 2},
+
+      {TensorEAlignmentM * 3, GemmN, TensorEAlignmentK * 2, 2},
+      {TensorEAlignmentM * 3, GemmN, TensorEAlignmentK * 2, 2},
+      {TensorEAlignmentM * 3, GemmN, TensorEAlignmentK * 3, 2},
+
+      {TensorEAlignmentM * 1, GemmN, TensorEAlignmentK * 2, 3},
+      {TensorEAlignmentM * 1, GemmN, TensorEAlignmentK * 2, 3},
+      {TensorEAlignmentM * 1, GemmN, TensorEAlignmentK * 3, 3},
+
+      {TensorEAlignmentM * 2, GemmN, TensorEAlignmentK * 2, 3},
+      {TensorEAlignmentM * 2, GemmN, TensorEAlignmentK * 2, 3},
+      {TensorEAlignmentM * 2, GemmN, TensorEAlignmentK * 3, 3},
+
+      {TensorEAlignmentM * 3, GemmN, TensorEAlignmentK * 2, 3},
+      {TensorEAlignmentM * 3, GemmN, TensorEAlignmentK * 2, 3},
+      {TensorEAlignmentM * 3, GemmN, TensorEAlignmentK * 3, 3},
+    };
+
+    const std::vector<ProblemType> problems_multiplier_of_tensor_e_atom_large = {
+      // * Large Case (multiplier of TensorEAlignment)
+      {TensorEAlignmentM * 10, GemmN, TensorEAlignmentK * 13, 1},
+      // {TensorEAlignmentM * 11, GemmN, TensorEAlignmentK * 14, 2},
+      // {TensorEAlignmentM * 12, GemmN, TensorEAlignmentK * 15, 3},
+    };
+
+    const std::vector<ProblemType> problems_multiplier_of_twochunk {
+      // * Corner Cases
+      {4, GemmN, LogicalElemsAPerChunk * 2, 1},
+      {4, GemmN, LogicalElemsAPerChunk * 4, 1},
+      {4, GemmN, LogicalElemsAPerChunk * 6, 1},
+      {4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 2, 1},
+      {4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 4, 1},
+      {4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 6, 1},
+      {4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 2, 1},
+      {4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 4, 1},
+      {4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 6, 1},
+
+      {4, GemmN, LogicalElemsAPerChunk * 2, 2},
+      {4, GemmN, LogicalElemsAPerChunk * 4, 2},
+      {4, GemmN, LogicalElemsAPerChunk * 6, 2},
+      {4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 2, 2},
+      {4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 4, 2},
+      {4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 6, 2},
+      {4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 2, 2},
+      {4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 4, 2},
+      {4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 6, 2},
+
+      {4, GemmN, LogicalElemsAPerChunk * 2, 3},
+      {4, GemmN, LogicalElemsAPerChunk * 4, 3},
+      {4, GemmN, LogicalElemsAPerChunk * 6, 3},
+      {4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 2, 3},
+      {4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 4, 3},
+      {4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 6, 3},
+      {4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 2, 3},
+      {4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 4, 3},
+      {4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 6, 3},
+
+      {32 + 4, GemmN, LogicalElemsAPerChunk * 2, 1},
+      {32 + 4, GemmN, LogicalElemsAPerChunk * 4, 1},
+      {32 + 4, GemmN, LogicalElemsAPerChunk * 6, 1},
+      {32 + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 2, 1},
+      {32 + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 4, 1},
+      {32 + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 6, 1},
+      {32 + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 2, 1},
+      {32 + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 4, 1},
+      {32 + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 6, 1},
+
+      {32 + 4, GemmN, LogicalElemsAPerChunk * 2, 2},
+      {32 + 4, GemmN, LogicalElemsAPerChunk * 4, 2},
+      {32 + 4, GemmN, LogicalElemsAPerChunk * 6, 2},
+      {32 + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 2, 2},
+      {32 + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 4, 2},
+      {32 + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 6, 2},
+      {32 + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 2, 2},
+      {32 + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 4, 2},
+      {32 + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 6, 2},
+
+      {32 + 4, GemmN, LogicalElemsAPerChunk * 2, 3},
+      {32 + 4, GemmN, LogicalElemsAPerChunk * 4, 3},
+      {32 + 4, GemmN, LogicalElemsAPerChunk * 6, 3},
+      {32 + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 2, 3},
+      {32 + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 4, 3},
+      {32 + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 6, 3},
+      {32 + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 2, 3},
+      {32 + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 4, 3},
+      {32 + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 6, 3},
+
+      {TensorEAlignmentM + 4, GemmN, LogicalElemsAPerChunk * 2, 1},
+      {TensorEAlignmentM + 4, GemmN, LogicalElemsAPerChunk * 4, 1},
+      {TensorEAlignmentM + 4, GemmN, LogicalElemsAPerChunk * 6, 1},
+      {TensorEAlignmentM + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 2, 1},
+      {TensorEAlignmentM + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 4, 1},
+      {TensorEAlignmentM + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 6, 1},
+      {TensorEAlignmentM + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 2, 1},
+      {TensorEAlignmentM + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 4, 1},
+      {TensorEAlignmentM + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 6, 1},
+
+      {TensorEAlignmentM + 4, GemmN, LogicalElemsAPerChunk * 2, 2},
+      {TensorEAlignmentM + 4, GemmN, LogicalElemsAPerChunk * 4, 2},
+      {TensorEAlignmentM + 4, GemmN, LogicalElemsAPerChunk * 6, 2},
+      {TensorEAlignmentM + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 2, 2},
+      {TensorEAlignmentM + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 4, 2},
+      {TensorEAlignmentM + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 6, 2},
+      {TensorEAlignmentM + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 2, 2},
+      {TensorEAlignmentM + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 4, 2},
+      {TensorEAlignmentM + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 6, 2},
+
+      {TensorEAlignmentM + 4, GemmN, LogicalElemsAPerChunk * 2, 3},
+      {TensorEAlignmentM + 4, GemmN, LogicalElemsAPerChunk * 4, 3},
+      {TensorEAlignmentM + 4, GemmN, LogicalElemsAPerChunk * 6, 3},
+      {TensorEAlignmentM + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 2, 3},
+      {TensorEAlignmentM + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 4, 3},
+      {TensorEAlignmentM + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 6, 3},
+      {TensorEAlignmentM + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 2, 3},
+      {TensorEAlignmentM + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 4, 3},
+      {TensorEAlignmentM + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 6, 3},
+
+      {TensorEAlignmentM * 2 + 4, GemmN, LogicalElemsAPerChunk * 2, 1},
+      {TensorEAlignmentM * 2 + 4, GemmN, LogicalElemsAPerChunk * 4, 1},
+      {TensorEAlignmentM * 2 + 4, GemmN, LogicalElemsAPerChunk * 6, 1},
+      {TensorEAlignmentM * 2 + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 2, 1},
+      {TensorEAlignmentM * 2 + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 4, 1},
+      {TensorEAlignmentM * 2 + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 6, 1},
+      {TensorEAlignmentM * 2 + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 2, 1},
+      {TensorEAlignmentM * 2 + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 4, 1},
+      {TensorEAlignmentM * 2 + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 6, 1},
+
+      {TensorEAlignmentM * 2 + 4, GemmN, LogicalElemsAPerChunk * 2, 2},
+      {TensorEAlignmentM * 2 + 4, GemmN, LogicalElemsAPerChunk * 4, 2},
+      {TensorEAlignmentM * 2 + 4, GemmN, LogicalElemsAPerChunk * 6, 2},
+      {TensorEAlignmentM * 2 + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 2, 2},
+      {TensorEAlignmentM * 2 + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 4, 2},
+      {TensorEAlignmentM * 2 + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 6, 2},
+      {TensorEAlignmentM * 2 + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 2, 2},
+      {TensorEAlignmentM * 2 + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 4, 2},
+      {TensorEAlignmentM * 2 + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 6, 2},
+
+      {TensorEAlignmentM * 2 + 4, GemmN, LogicalElemsAPerChunk * 2, 3},
+      {TensorEAlignmentM * 2 + 4, GemmN, LogicalElemsAPerChunk * 4, 3},
+      {TensorEAlignmentM * 2 + 4, GemmN, LogicalElemsAPerChunk * 6, 3},
+      {TensorEAlignmentM * 2 + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 2, 3},
+      {TensorEAlignmentM * 2 + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 4, 3},
+      {TensorEAlignmentM * 2 + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 6, 3},
+      {TensorEAlignmentM * 2 + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 2, 3},
+      {TensorEAlignmentM * 2 + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 4, 3},
+      {TensorEAlignmentM * 2 + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 6, 3},
+    };
+
+    const std::vector<ProblemType> problems_multiplier_of_onechunk {
+      {4, GemmN, LogicalElemsAPerChunk * 1, 1},
+      {4, GemmN, LogicalElemsAPerChunk * 3, 1},
+      {4, GemmN, LogicalElemsAPerChunk * 5, 1},
+      {4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 1, 1},
+      {4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 3, 1},
+      {4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 5, 1},
+      {4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 1, 1},
+      {4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 3, 1},
+      {4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 5, 1},
+
+      {4, GemmN, LogicalElemsAPerChunk * 1, 2},
+      {4, GemmN, LogicalElemsAPerChunk * 3, 2},
+      {4, GemmN, LogicalElemsAPerChunk * 5, 2},
+      {4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 1, 2},
+      {4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 3, 2},
+      {4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 5, 2},
+      {4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 1, 2},
+      {4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 3, 2},
+      {4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 5, 2},
+
+      {4, GemmN, LogicalElemsAPerChunk * 1, 3},
+      {4, GemmN, LogicalElemsAPerChunk * 3, 3},
+      {4, GemmN, LogicalElemsAPerChunk * 5, 3},
+      {4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 1, 3},
+      {4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 3, 3},
+      {4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 5, 3},
+      {4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 1, 3},
+      {4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 3, 3},
+      {4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 5, 3},
+
+      {32 + 4, GemmN, LogicalElemsAPerChunk * 1, 1},
+      {32 + 4, GemmN, LogicalElemsAPerChunk * 3, 1},
+      {32 + 4, GemmN, LogicalElemsAPerChunk * 5, 1},
+      {32 + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 1, 1},
+      {32 + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 3, 1},
+      {32 + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 5, 1},
+      {32 + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 1, 1},
+      {32 + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 3, 1},
+      {32 + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 5, 1},
+
+      {32 + 4, GemmN, LogicalElemsAPerChunk * 1, 2},
+      {32 + 4, GemmN, LogicalElemsAPerChunk * 3, 2},
+      {32 + 4, GemmN, LogicalElemsAPerChunk * 5, 2},
+      {32 + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 1, 2},
+      {32 + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 3, 2},
+      {32 + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 5, 2},
+      {32 + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 1, 2},
+      {32 + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 3, 2},
+      {32 + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 5, 2},
+
+      {32 + 4, GemmN, LogicalElemsAPerChunk * 1, 3},
+      {32 + 4, GemmN, LogicalElemsAPerChunk * 3, 3},
+      {32 + 4, GemmN, LogicalElemsAPerChunk * 5, 3},
+      {32 + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 1, 3},
+      {32 + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 3, 3},
+      {32 + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 5, 3},
+      {32 + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 1, 3},
+      {32 + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 3, 3},
+      {32 + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 5, 3},
+
+      {TensorEAlignmentM + 4, GemmN, LogicalElemsAPerChunk * 1, 1},
+      {TensorEAlignmentM + 4, GemmN, LogicalElemsAPerChunk * 3, 1},
+      {TensorEAlignmentM + 4, GemmN, LogicalElemsAPerChunk * 5, 1},
+      {TensorEAlignmentM + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 1, 1},
+      {TensorEAlignmentM + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 3, 1},
+      {TensorEAlignmentM + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 5, 1},
+      {TensorEAlignmentM + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 1, 1},
+      {TensorEAlignmentM + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 3, 1},
+      {TensorEAlignmentM + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 5, 1},
+
+      {TensorEAlignmentM + 4, GemmN, LogicalElemsAPerChunk * 1, 2},
+      {TensorEAlignmentM + 4, GemmN, LogicalElemsAPerChunk * 3, 2},
+      {TensorEAlignmentM + 4, GemmN, LogicalElemsAPerChunk * 5, 2},
+      {TensorEAlignmentM + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 1, 2},
+      {TensorEAlignmentM + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 3, 2},
+      {TensorEAlignmentM + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 5, 2},
+      {TensorEAlignmentM + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 1, 2},
+      {TensorEAlignmentM + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 3, 2},
+      {TensorEAlignmentM + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 5, 2},
+
+      {TensorEAlignmentM + 4, GemmN, LogicalElemsAPerChunk * 1, 3},
+      {TensorEAlignmentM + 4, GemmN, LogicalElemsAPerChunk * 3, 3},
+      {TensorEAlignmentM + 4, GemmN, LogicalElemsAPerChunk * 5, 3},
+      {TensorEAlignmentM + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 1, 3},
+      {TensorEAlignmentM + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 3, 3},
+      {TensorEAlignmentM + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 5, 3},
+      {TensorEAlignmentM + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 1, 3},
+      {TensorEAlignmentM + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 3, 3},
+      {TensorEAlignmentM + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 5, 3},
+
+      {TensorEAlignmentM * 2 + 4, GemmN, LogicalElemsAPerChunk * 1, 1},
+      {TensorEAlignmentM * 2 + 4, GemmN, LogicalElemsAPerChunk * 3, 1},
+      {TensorEAlignmentM * 2 + 4, GemmN, LogicalElemsAPerChunk * 5, 1},
+      {TensorEAlignmentM * 2 + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 1, 1},
+      {TensorEAlignmentM * 2 + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 3, 1},
+      {TensorEAlignmentM * 2 + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 5, 1},
+      {TensorEAlignmentM * 2 + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 1, 1},
+      {TensorEAlignmentM * 2 + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 3, 1},
+      {TensorEAlignmentM * 2 + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 5, 1},
+
+      {TensorEAlignmentM * 2 + 4, GemmN, LogicalElemsAPerChunk * 1, 2},
+      {TensorEAlignmentM * 2 + 4, GemmN, LogicalElemsAPerChunk * 3, 2},
+      {TensorEAlignmentM * 2 + 4, GemmN, LogicalElemsAPerChunk * 5, 2},
+      {TensorEAlignmentM * 2 + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 1, 2},
+      {TensorEAlignmentM * 2 + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 3, 2},
+      {TensorEAlignmentM * 2 + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 5, 2},
+      {TensorEAlignmentM * 2 + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 1, 2},
+      {TensorEAlignmentM * 2 + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 3, 2},
+      {TensorEAlignmentM * 2 + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 5, 2},
+
+      {TensorEAlignmentM * 2 + 4, GemmN, LogicalElemsAPerChunk * 1, 3},
+      {TensorEAlignmentM * 2 + 4, GemmN, LogicalElemsAPerChunk * 3, 3},
+      {TensorEAlignmentM * 2 + 4, GemmN, LogicalElemsAPerChunk * 5, 3},
+      {TensorEAlignmentM * 2 + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 1, 3},
+      {TensorEAlignmentM * 2 + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 3, 3},
+      {TensorEAlignmentM * 2 + 4, GemmN, TensorEAlignmentK + LogicalElemsAPerChunk * 5, 3},
+      {TensorEAlignmentM * 2 + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 1, 3},
+      {TensorEAlignmentM * 2 + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 3, 3},
+      {TensorEAlignmentM * 2 + 4, GemmN, TensorEAlignmentK * 2 + LogicalElemsAPerChunk * 5, 3},
+    };
+
+    // Run small only run multiplier of chunk size cases
+    if (run_small) {
+      problems.insert(problems.end(), problems_multiplier_of_tensor_e_atom.begin(), problems_multiplier_of_tensor_e_atom.end());
+    }
+    // Run full run all corner cases
+    else {
+      problems.insert(problems.end(), problems_multiplier_of_tensor_e_atom_large.begin(), problems_multiplier_of_tensor_e_atom_large.end());
+      problems.insert(problems.end(), problems_multiplier_of_tensor_e_atom.begin(), problems_multiplier_of_tensor_e_atom.end());
+      problems.insert(problems.end(), problems_multiplier_of_twochunk.begin(), problems_multiplier_of_twochunk.end());
+      problems.insert(problems.end(), problems_multiplier_of_onechunk.begin(), problems_multiplier_of_onechunk.end());
+    }
+
+    for (const auto& problem_shape_MNKL : problems) {
+      const auto [GemmM, GemmN, GemmK, GemmL] = problem_shape_MNKL;
+      bool passed = run({GemmM, GemmN, GemmK, GemmL});
+      printf("run() (%.4d,%.4d,%.4d,%.4d) %s\n", GemmM, GemmN, GemmK, GemmL, passed ? "PASS" : "FAIL");
+      CUTLASS_TRACE_HOST("run() " << GemmM << " " << GemmN << " " << GemmK << " " << GemmL << passed ? " PASS" : " FAIL");
+      if (not passed) {
+        return false;
+      }
+    }
+
+    return true;
+  }
+
+  bool run(ProblemShapeType problem_shape_MNKL)
+  {
+    // Check if valid test
+    if (not valid_test(problem_shape_MNKL)) {
+      CUTLASS_TRACE_HOST("valid_test() fail\n");
+      return false;
+    }
+
+    // Data Storage
+    Data datas;
+
+    // Initialize Data
+    if (not initialize(problem_shape_MNKL, datas)) {
+      CUTLASS_TRACE_HOST("initialize() fail\n");
+      return false;
+    }
+
+    // Run Compressor (Host Ref)
+    if (not run_host_ref(problem_shape_MNKL, datas)) {
+      CUTLASS_TRACE_HOST("run_host() fail\n");
+      return false;
+    }
+
+    // Run Compressor (Device)
+    if (not run_device(problem_shape_MNKL, datas)) {
+      CUTLASS_TRACE_HOST("run_device() fail\n");
+      return false;
+    }
+
+    // Verify
+    if (not compare_reference(datas)) {
+      CUTLASS_TRACE_HOST("compare_reference() DEVICE <-> LEGACY HOST fail\n");
+      printf("compare_reference() DEVICE <-> LEGACY HOST fail\n");
+      return false;
+    }
+    // else {
+    //   printf("DEVICE <-> HOST PASS\n");
+    // }
+
+    return true;
+  }
+
+  bool benchmark(ProblemShapeType problem_shape_MNKL) {
+    const auto [GemmM, GemmN, GemmK, GemmL] = problem_shape_MNKL;
+    printf("Benchmark() (%.4d,%.4d,%.4d,%.4d) START\n", GemmM, GemmN, GemmK, GemmL);
+
+    // Check if valid test
+    if (valid_test(problem_shape_MNKL) == false) {
+      CUTLASS_TRACE_HOST("valid_test() fail\n");
+      return false;
+    }
+
+    // 2 warm-up iterations and 10 timing iterations
+    constexpr int num_warmup = 5;
+    constexpr int num_iter = 10;
+
+    // Duplicate data to mimic cold cache
+    Data data[num_warmup + num_iter];
+    double total_time_milliseconds{0.0};
+
+    for (int i = 0; i < num_warmup + num_iter; ++i ) {
+      printf("Benchmark() (%.4d,%.4d,%.4d,%.4d) ITER %d\n", GemmM, GemmN, GemmK, GemmL, i );
+
+      auto& datum_i = data[i];
+
+      // Initialize Data  
+      if (initialize(problem_shape_MNKL, datum_i) == false) {
+        CUTLASS_TRACE_HOST("initialize() fail\n");
+        return false;
+      }
+
+      // Run Compressor (Device)
+      double time_i_milliseconds{0.0f};
+      if (not run_device(problem_shape_MNKL, datum_i, &time_i_milliseconds)) {
+        CUTLASS_TRACE_HOST("run_device() fail\n");
+        return false;
+      }
+
+      if ( i >= num_warmup ) {
+        total_time_milliseconds += time_i_milliseconds;
+      }
+    }
+
+    const double mean_time_milliseconds = total_time_milliseconds / num_iter;
+    printf("Mean time (ms): %.5f\n", mean_time_milliseconds);
+
+    return true;
+  }
+
+public:
+  // Data Init Setting
+  cutlass::Distribution::Kind init_A;
+  cutlass::Distribution::Kind init_A_Comp;
+  cutlass::Distribution::Kind init_E;
+  uint64_t seed;
+};
+
+}  // namespace device
+}  // namespace transform
+}  // namespace test
diff --git a/tools/library/CMakeLists.txt b/tools/library/CMakeLists.txt
index 98b06c66b1..43eb1ebda2 100644
--- a/tools/library/CMakeLists.txt
+++ b/tools/library/CMakeLists.txt
@@ -221,22 +221,24 @@ if (NOT CUTLASS_ENABLE_SYCL)
     src/singleton.cu
     src/util.cu
 
-    # files split for parallel compilation
-    src/reference/gemm_int4.cu
-    src/reference/gemm_int8_canonical.cu
-    src/reference/gemm_int8_interleaved_32.cu
-    src/reference/gemm_int8_interleaved_64.cu
-    src/reference/gemm_e4m3a_e4m3out.cu
-    src/reference/gemm_e5m2a_e4m3out.cu
-    src/reference/gemm_e4m3a_e5m2out.cu
-    src/reference/gemm_e5m2a_e5m2out.cu
-    src/reference/gemm_fp8in_fp16out.cu
-    src/reference/gemm_fp8in_bf16out.cu
-    src/reference/gemm_fp8in_fp32out.cu
-    src/reference/gemm_fp32out.cu
-    src/reference/gemm_fp_other.cu
-    src/reference/gemm_fp_mixed_input.cu
-    src/reference/initialize_reference_operations.cu
+  # files split for parallel compilation
+  src/reference/gemm_int4.cu
+  src/reference/gemm_s8_s8_s32.cu
+  src/reference/gemm_u8_u8_s32.cu
+  src/reference/gemm_int8_interleaved_32.cu
+  src/reference/gemm_int8_interleaved_64.cu
+  src/reference/gemm_e4m3a_e4m3out.cu
+  src/reference/gemm_e5m2a_e4m3out.cu
+  src/reference/gemm_e4m3a_e5m2out.cu
+  src/reference/gemm_e5m2a_e5m2out.cu
+  src/reference/gemm_fp8in_fp16out.cu
+  src/reference/gemm_fp8in_bf16out.cu
+  src/reference/gemm_fp8in_fp32out.cu
+  src/reference/gemm_fp32out.cu
+  src/reference/gemm_fp_other.cu
+  src/reference/gemm_fp_mixed_input.cu
+  src/reference/gemm_int_mixed_input.cu
+  src/reference/initialize_reference_operations.cu
 
     # cutlass reduction instances in cutlass library
 
@@ -279,6 +281,7 @@ execute_process(
     --generator-target library
     --architectures "${CUTLASS_NVCC_ARCHS_ENABLED}"
     --kernels "${CUTLASS_LIBRARY_KERNELS}"
+    --instantiation-level "${CUTLASS_LIBRARY_INSTANTIATION_LEVEL}"
     --ignore-kernels "${CUTLASS_LIBRARY_IGNORE_KERNELS}"
     --exclude-kernels "${CUTLASS_LIBRARY_EXCLUDE_KERNELS}"
     --kernel-filter-file "${KERNEL_FILTER_FILE}"
diff --git a/tools/library/include/cutlass/library/arch_mappings.h b/tools/library/include/cutlass/library/arch_mappings.h
index 399c017f84..74a768c56a 100644
--- a/tools/library/include/cutlass/library/arch_mappings.h
+++ b/tools/library/include/cutlass/library/arch_mappings.h
@@ -113,6 +113,12 @@ template <> struct ArchMap<arch::Sm90, arch::OpClassTensorOp> {
   static int const kMax = 90;
 };
 
+// Arch conditional sparse WGMMA
+template <> struct ArchMap<arch::Sm90, arch::OpClassSparseTensorOp> {
+  static int const kMin = 90;
+  static int const kMax = 90;
+};
+
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
 } // namespace library
diff --git a/tools/library/include/cutlass/library/library.h b/tools/library/include/cutlass/library/library.h
index c609367cb2..56e6e455de 100644
--- a/tools/library/include/cutlass/library/library.h
+++ b/tools/library/include/cutlass/library/library.h
@@ -103,6 +103,17 @@ class Operation {
     void *device_workspace = nullptr,
     cudaStream_t stream = nullptr) const = 0;
 
+  // Originally designed for metadata, but should be useful for FP8/6/4 too.
+  virtual Status initialize_with_profiler_workspace(
+    void const *configuration,
+    void *host_workspace,
+    void *device_workspace,
+    uint8_t **profiler_workspace_ptrs,
+    int problem_count,
+    cudaStream_t stream = nullptr) {
+    return Status::kErrorNotSupported;
+  }
+
   virtual Status run(
     void const *arguments,
     void *host_workspace,
@@ -290,7 +301,6 @@ struct GemmUniversalArguments {
 
   // Needed for some 3.x kernels
   int sm_count{0};
-
   library::RasterOrder raster_order{};
   int swizzle_size{1};
 };
diff --git a/tools/library/src/conv_operation_3x.hpp b/tools/library/src/conv_operation_3x.hpp
index 15fb330300..0093182201 100644
--- a/tools/library/src/conv_operation_3x.hpp
+++ b/tools/library/src/conv_operation_3x.hpp
@@ -616,7 +616,7 @@ class ConvOperation3x : public Operation {
         /* traversal_stride = */ {traversal_stride_h, traversal_stride_w},
         /* dilation         = */ {dilation_h, dilation_w},
                                  num_groups);
-      out_args.mainloop.problem_shape = problem_shape;
+      out_args.problem_shape = problem_shape;
 
       // ConvProblemShape's constructor sets its shape_C member.
 #if defined(CUTLASS_DEBUG_TRACE_LEVEL) && (CUTLASS_DEBUG_TRACE_LEVEL > 1)
@@ -788,7 +788,7 @@ class ConvOperation3x : public Operation {
         /* traversal_stride = */ {traversal_stride_d, traversal_stride_h, traversal_stride_w},
         /* dilation         = */ {dilation_d, dilation_h, dilation_w},
                                  num_groups);
-      out_args.mainloop.problem_shape = problem_shape;
+      out_args.problem_shape = problem_shape;
 
       // ConvProblemShape's constructor sets its shape_C member.
 #if defined(CUTLASS_DEBUG_TRACE_LEVEL) && (CUTLASS_DEBUG_TRACE_LEVEL > 1)
diff --git a/tools/library/src/gemm_operation_3x.hpp b/tools/library/src/gemm_operation_3x.hpp
index 4f743f74b7..f4918b7d37 100644
--- a/tools/library/src/gemm_operation_3x.hpp
+++ b/tools/library/src/gemm_operation_3x.hpp
@@ -249,7 +249,6 @@ class GemmUniversal3xOperation : public GemmOperation3xBase<Operator_> {
 
     /* Query device SM count to pass onto the kernel as an argument, where needed */
     operator_args.hw_info.sm_count = arguments->sm_count;
-
     if constexpr (!std::is_const_v<decltype(operator_args.scheduler.max_swizzle_size)>) {
       operator_args.scheduler.max_swizzle_size = arguments->swizzle_size;
     }
@@ -282,17 +281,18 @@ class GemmUniversal3xOperation : public GemmOperation3xBase<Operator_> {
       static_cast<GemmUniversalArguments const *>(arguments_ptr);
 
     OperatorArguments args;
-    auto status = update_arguments_(args, arguments);
-    if (status != Status::kSuccess) {
-      return status;
-    }
-
     // can_implement rules may need access to problem shape
     args.problem_shape = cute::make_shape(
       configuration->problem_size.m(),
       configuration->problem_size.n(),
       configuration->problem_size.k(),
       configuration->batch_count);
+
+    auto status = update_arguments_(args, arguments);
+    if (status != Status::kSuccess) {
+      return status;
+    }
+
     return Operator::can_implement(args);
   }
 
diff --git a/tools/library/src/reference/gemm_fp_mixed_input.cu b/tools/library/src/reference/gemm_fp_mixed_input.cu
index 7937552cd9..46949236a6 100644
--- a/tools/library/src/reference/gemm_fp_mixed_input.cu
+++ b/tools/library/src/reference/gemm_fp_mixed_input.cu
@@ -50,15 +50,34 @@ void initialize_gemm_reference_operations_fp_mixed_input(Manifest &manifest) {
   make_gemm_real_canonical_layouts<
     int8_t,
     half_t,
+    float,
+    float
+  >(manifest);
+
+  make_gemm_real_canonical_layouts<
+    uint8_t,
+    half_t,
+    float,
+    float
+  >(manifest);
+
+  make_gemm_real_canonical_layouts<
+    int8_t,
     half_t,
     half_t,
-    half_t
+    float
   >(manifest);
 
   make_gemm_real_canonical_layouts<
     uint8_t,
     half_t,
     half_t,
+    float
+  >(manifest);
+
+  make_gemm_real_canonical_layouts<
+    int8_t,
+    half_t,
     half_t,
     half_t
   >(manifest);
@@ -67,39 +86,62 @@ void initialize_gemm_reference_operations_fp_mixed_input(Manifest &manifest) {
     uint8_t,
     half_t,
     half_t,
+    half_t
+  >(manifest);
+
+  make_gemm_real_canonical_layouts<
+    half_t,
+    int8_t,
     float,
     float
   >(manifest);
 
   make_gemm_real_canonical_layouts<
-    int8_t,
-    half_t,
     half_t,
+    uint8_t,
     float,
     float
   >(manifest);
 
+  make_gemm_real_canonical_layouts<
+    half_t,
+    int8_t,
+    half_t,
+    half_t
+  >(manifest);
+
   make_gemm_real_canonical_layouts<
     half_t,
     uint8_t,
     half_t,
-    float,
-    float
+    half_t
   >(manifest);
 
   make_gemm_real_canonical_layouts<
     half_t,
     int8_t,
     half_t,
+    float 
+  >(manifest);
+
+  make_gemm_real_canonical_layouts<
+    half_t,
+    uint8_t,
+    half_t,
+    float 
+  >(manifest);
+
+  // bfloat16_t mixed with 8-bit integer input
+  make_gemm_real_canonical_layouts<
+    int8_t,
+    bfloat16_t,
     float,
     float
   >(manifest);
 
-  // bfloat16_t mixed with 8-bit integer input
   make_gemm_real_canonical_layouts<
     uint8_t,
     bfloat16_t,
-    bfloat16_t,
     float,
     float
   >(manifest);
@@ -107,23 +149,20 @@ void initialize_gemm_reference_operations_fp_mixed_input(Manifest &manifest) {
   make_gemm_real_canonical_layouts<
     int8_t,
     bfloat16_t,
-    float,
-    float,
+    bfloat16_t,
     float
   >(manifest);
 
   make_gemm_real_canonical_layouts<
-    int8_t,
+    uint8_t,
     bfloat16_t,
     bfloat16_t,
-    float,
     float
   >(manifest);
 
   make_gemm_real_canonical_layouts<
     bfloat16_t,
-    uint8_t,
-    float,
+    int8_t,
     float,
     float
   >(manifest);
@@ -131,7 +170,6 @@ void initialize_gemm_reference_operations_fp_mixed_input(Manifest &manifest) {
   make_gemm_real_canonical_layouts<
     bfloat16_t,
     uint8_t,
-    bfloat16_t,
     float,
     float
   >(manifest);
@@ -140,7 +178,13 @@ void initialize_gemm_reference_operations_fp_mixed_input(Manifest &manifest) {
     bfloat16_t,
     int8_t,
     bfloat16_t,
-    float,
+    float
+  >(manifest);
+
+  make_gemm_real_canonical_layouts<
+    bfloat16_t,
+    uint8_t,
+    bfloat16_t,
     float
   >(manifest);
 }
diff --git a/tools/library/src/reference/gemm_fp_other.cu b/tools/library/src/reference/gemm_fp_other.cu
index 3c9ba2dc5e..3a196c209f 100644
--- a/tools/library/src/reference/gemm_fp_other.cu
+++ b/tools/library/src/reference/gemm_fp_other.cu
@@ -54,6 +54,14 @@ void initialize_gemm_reference_operations_fp_other(Manifest &manifest) {
     half_t
   >(manifest);
 
+  make_gemm_real_canonical_layouts<
+    half_t,
+    half_t,
+    float,
+    half_t,
+    half_t
+  >(manifest);
+
   make_gemm_real_canonical_layouts<
     double,
     double,
diff --git a/tools/library/src/reference/gemm_int8_canonical.cu b/tools/library/src/reference/gemm_int_mixed_input.cu
similarity index 88%
rename from tools/library/src/reference/gemm_int8_canonical.cu
rename to tools/library/src/reference/gemm_int_mixed_input.cu
index 25b062abec..c37ddfe45d 100644
--- a/tools/library/src/reference/gemm_int8_canonical.cu
+++ b/tools/library/src/reference/gemm_int_mixed_input.cu
@@ -1,5 +1,5 @@
 /***************************************************************************************************
- * Copyright (c) 2017 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
  * SPDX-License-Identifier: BSD-3-Clause
  *
  * Redistribution and use in source and binary forms, with or without
@@ -45,27 +45,27 @@ namespace library {
 
 ///////////////////////////////////////////////////////////////////////////////////////////////////
 
-void initialize_gemm_reference_operations_int8_canonical(Manifest &manifest) {
+void initialize_gemm_reference_operations_int_mixed_input(Manifest &manifest) {
+  // 4-bit integer mixed with 8-bit integer input
   make_gemm_real_canonical_layouts<
+    int4b_t,
     int8_t,
-    int8_t,
-    int32_t,
     int32_t,
     int32_t
   >(manifest);
 
   make_gemm_real_canonical_layouts<
+    int4b_t,
     int8_t,
     int8_t,
-    int8_t,
-    float,
+    int32_t,
     int32_t,
     int8_t,
-    NumericConverterClamp<int8_t, float>
+    NumericConverterClamp<int8_t, int32_t>
   >(manifest);
 
   make_gemm_real_canonical_layouts<
-    int8_t,
+    int4b_t,
     int8_t,
     int32_t,
     float,
@@ -75,26 +75,35 @@ void initialize_gemm_reference_operations_int8_canonical(Manifest &manifest) {
   >(manifest);
 
   make_gemm_real_canonical_layouts<
-    uint8_t,
-    uint8_t,
+    int4b_t,
+    int8_t,
+    int8_t,
+    float,
     int32_t,
+    int8_t,
+    NumericConverterClamp<int8_t, float>
+  >(manifest);
+
+  make_gemm_real_canonical_layouts<
+    int8_t,
+    int4b_t,
     int32_t,
     int32_t
   >(manifest);
 
   make_gemm_real_canonical_layouts<
-    uint8_t,
-    uint8_t,
     int8_t,
-    float,
+    int4b_t,
+    int8_t,
+    int32_t,
     int32_t,
     int8_t,
-    NumericConverterClamp<int8_t, float>
+    NumericConverterClamp<int8_t, int32_t>
   >(manifest);
 
   make_gemm_real_canonical_layouts<
-    uint8_t,
-    uint8_t,
+    int8_t,
+    int4b_t,
     int32_t,
     float,
     int32_t,
@@ -104,12 +113,12 @@ void initialize_gemm_reference_operations_int8_canonical(Manifest &manifest) {
 
   make_gemm_real_canonical_layouts<
     int8_t,
+    int4b_t,
     int8_t,
-    int8_t,   
-    int32_t,
+    float,
     int32_t,
     int8_t,
-    NumericConverterClamp<int8_t, int32_t>
+    NumericConverterClamp<int8_t, float>
   >(manifest);
 }
 
@@ -119,4 +128,3 @@ void initialize_gemm_reference_operations_int8_canonical(Manifest &manifest) {
 } // namespace cutlass
 
 ///////////////////////////////////////////////////////////////////////////////////////////////////
-
diff --git a/tools/library/src/reference/gemm_s8_s8_s32.cu b/tools/library/src/reference/gemm_s8_s8_s32.cu
new file mode 100644
index 0000000000..d88e986f56
--- /dev/null
+++ b/tools/library/src/reference/gemm_s8_s8_s32.cu
@@ -0,0 +1,146 @@
+/***************************************************************************************************
+ * Copyright (c) 2017 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/* \file
+   \brief Instantiates GEMM reference implementations.
+*/
+
+#include "cutlass/cutlass.h"
+#include "cutlass/library/library.h"
+#include "cutlass/library/manifest.h"
+
+#include "gemm_reference_operation.h"
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+namespace cutlass {
+namespace library {
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+// A/B: s8
+// Acc : s32
+// C/D: some variance
+// Epi Scalar: some variance
+
+// 1. s8_s8_s32_s32_s32 (s32 epi scalar)
+// 2. s8_s8_s32_s32_s32 (f32 epi scalar)
+// 3. s8_s8_s32_s8_s8 (f32 epi scalar)
+// 4. s8_s8_s32_s8_s8 (s32 epi scalar)
+// 5. s8_s8_s32_s32_s8 (f32 epi scalar)
+// 6. s8_s8_s32_f32_f32
+// 7. s8_s8_s32_f16_f16 (f32 epi scalar)
+
+// D = convert( Scalar(alpha) * Scalar( A * B ) + Scalar(beta) * Scalar( C ) )
+// Convert: from epi Scalar dtype to D dtype
+
+void initialize_gemm_reference_operations_s8_s8_s32(Manifest &manifest) {
+  // 1.
+  make_gemm_real_canonical_layouts<
+    int8_t,                           // ElementA
+    int8_t,                           // ElementB
+    int32_t,                          // ElementC
+    int32_t,                          // ElementScalar / ElementCompute
+    int32_t,                          // ElementAccumulator
+    int32_t                           // ElementD
+  >(manifest);
+
+  // 2.
+  make_gemm_real_canonical_layouts<
+    int8_t,                           // ElementA
+    int8_t,                           // ElementB
+    int32_t,                          // ElementC
+    int32_t,                          // ElementScalar / ElementCompute
+    int32_t,                          // ElementAccumulator
+    int32_t                           // ElementD
+  >(manifest);
+
+  // 3.
+  make_gemm_real_canonical_layouts<
+    int8_t,                           // ElementA
+    int8_t,                           // ElementB
+    int8_t,                           // ElementC
+    float,                            // ElementScalar / ElementCompute
+    int32_t,                          // ElementAccumulator
+    int8_t,                           // ElementD
+    NumericConverterClamp<int8_t, float> // From Scalar to D
+  >(manifest);
+
+  // 4.
+  make_gemm_real_canonical_layouts<
+    int8_t,                           // ElementA
+    int8_t,                           // ElementB
+    int8_t,                           // ElementC
+    int32_t,                          // ElementScalar / ElementCompute
+    int32_t,                          // ElementAccumulator
+    int8_t,                           // ElementD
+    NumericConverterClamp<int8_t, int32_t> // From Scalar to D
+  >(manifest);
+
+  // 5.
+  make_gemm_real_canonical_layouts<
+    int8_t,                           // ElementA
+    int8_t,                           // ElementB
+    int32_t,                          // ElementC
+    float,                            // ElementScalar / ElementCompute
+    int32_t,                          // ElementAccumulator
+    int8_t,                           // ElementD
+    NumericConverterClamp<int8_t, float> // From Scalar to D
+  >(manifest);
+
+  // 6.
+  make_gemm_real_canonical_layouts<
+    int8_t,                           // ElementA
+    int8_t,                           // ElementB
+    float,                            // ElementC
+    float,                            // ElementScalar / ElementCompute
+    int32_t,                          // ElementAccumulator
+    float                             // ElementD
+  >(manifest);
+
+  // 7.
+  make_gemm_real_canonical_layouts<
+    int8_t,                           // ElementA
+    int8_t,                           // ElementB
+    half_t,                           // ElementC
+    float,                            // ElementScalar / ElementCompute
+    int32_t,                          // ElementAccumulator
+    half_t,                           // ElementD
+    NumericConverterClamp<half_t, float> // From Scalar to D
+  >(manifest);
+}
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+} // namespace library
+} // namespace cutlass
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
diff --git a/tools/library/src/reference/gemm_u8_u8_s32.cu b/tools/library/src/reference/gemm_u8_u8_s32.cu
new file mode 100644
index 0000000000..f18f7e648c
--- /dev/null
+++ b/tools/library/src/reference/gemm_u8_u8_s32.cu
@@ -0,0 +1,98 @@
+/***************************************************************************************************
+ * Copyright (c) 2017 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/* \file
+   \brief Instantiates GEMM reference implementations.
+*/
+
+#include "cutlass/cutlass.h"
+#include "cutlass/library/library.h"
+#include "cutlass/library/manifest.h"
+
+#include "gemm_reference_operation.h"
+
+/////////////////////////////////////////////////////////////////////////////////////////////////
+
+namespace cutlass {
+namespace library {
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+// A/B: u8
+// Acc : s32
+// C/D: some variance
+
+// 1. u8_u8_s32_s32_s32 (s32 epi scalar)
+// 2. u8_u8_s32_s32_s32 (f32 epi scalar)
+// 3. u8_8_s32_s8_s8 (f32 epi scalar)
+// 3. u8_8_s32_s8_s8 (s epi scalar)
+
+void initialize_gemm_reference_operations_u8_u8_s32(Manifest &manifest) {
+  // 1.
+  make_gemm_real_canonical_layouts<
+    uint8_t,                          // ElementA
+    uint8_t,                          // ElementB
+    int32_t,                          // ElementC
+    int32_t,                          // ElementScalar / ElementCompute
+    int32_t,                          // ElementAccumulator
+    int32_t                           // ElementD
+  >(manifest);
+
+  // 2.
+  make_gemm_real_canonical_layouts<
+    uint8_t,                          // ElementA
+    uint8_t,                          // ElementB
+    int32_t,                          // ElementC
+    float,                            // ElementScalar / ElementCompute
+    int32_t,                          // ElementAccumulator
+    int32_t,                          // ElementD
+    NumericConverterClamp<int32_t, float> // From Scalar to D
+  >(manifest);
+
+  // 3.
+  make_gemm_real_canonical_layouts<
+    uint8_t,                          // ElementA
+    uint8_t,                          // ElementB
+    int8_t,                           // ElementC
+    float,                            // ElementScalar / ElementCompute
+    int32_t,                          // ElementAccumulator
+    int8_t,                           // ElementD
+    NumericConverterClamp<int8_t, float> // From Scalar to D
+  >(manifest);
+
+}
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+} // namespace library
+} // namespace cutlass
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
diff --git a/tools/library/src/reference/initialize_reference_operations.cu b/tools/library/src/reference/initialize_reference_operations.cu
index 16679a27d8..b097d580c3 100644
--- a/tools/library/src/reference/initialize_reference_operations.cu
+++ b/tools/library/src/reference/initialize_reference_operations.cu
@@ -46,7 +46,8 @@ namespace library {
 void initialize_gemm_reference_operations_int4(Manifest &manifest);
 void initialize_gemm_reference_operations_int8_interleaved_32(Manifest &manifest);
 void initialize_gemm_reference_operations_int8_interleaved_64(Manifest &manifest);
-void initialize_gemm_reference_operations_int8_canonical(Manifest &manifest);
+void initialize_gemm_reference_operations_s8_s8_s32(Manifest &manifest);
+void initialize_gemm_reference_operations_u8_u8_s32(Manifest &manifest);
 void initialize_gemm_reference_operations_e4m3a_e4m3out(Manifest &manifest);
 void initialize_gemm_reference_operations_e5m2a_e4m3out(Manifest &manifest);
 void initialize_gemm_reference_operations_e4m3a_e5m2out(Manifest &manifest);
@@ -57,6 +58,7 @@ void initialize_gemm_reference_operations_fp8in_fp32out(Manifest &manifest);
 void initialize_gemm_reference_operations_fp32out(Manifest &manifest);
 void initialize_gemm_reference_operations_fp_other(Manifest &manifest);
 void initialize_gemm_reference_operations_fp_mixed_input(Manifest &manifest);
+void initialize_gemm_reference_operations_int_mixed_input(Manifest &manifest);
 
 void initialize_conv2d_reference_operations(Manifest &manifest);
 void initialize_conv3d_reference_operations(Manifest &manifest);
@@ -71,7 +73,8 @@ void initialize_reference_operations(Manifest &manifest) {
 
   initialize_gemm_reference_operations_int8_interleaved_32(manifest);
   initialize_gemm_reference_operations_int8_interleaved_64(manifest);
-  initialize_gemm_reference_operations_int8_canonical(manifest);
+  initialize_gemm_reference_operations_s8_s8_s32(manifest);
+  initialize_gemm_reference_operations_u8_u8_s32(manifest);
 
   initialize_gemm_reference_operations_e4m3a_e4m3out(manifest);
   initialize_gemm_reference_operations_e5m2a_e4m3out(manifest);
@@ -84,6 +87,7 @@ void initialize_reference_operations(Manifest &manifest) {
   initialize_gemm_reference_operations_fp32out(manifest);
   initialize_gemm_reference_operations_fp_other(manifest);
   initialize_gemm_reference_operations_fp_mixed_input(manifest);
+  initialize_gemm_reference_operations_int_mixed_input(manifest);
 
 }
 
diff --git a/tools/library/src/sparse_gemm_operation_3x.hpp b/tools/library/src/sparse_gemm_operation_3x.hpp
new file mode 100644
index 0000000000..fec987f5a2
--- /dev/null
+++ b/tools/library/src/sparse_gemm_operation_3x.hpp
@@ -0,0 +1,445 @@
+/***************************************************************************************************
+ * Copyright (c) 2023 - 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice, this
+ * list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ * and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its
+ * contributors may be used to endorse or promote products derived from
+ * this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+ * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+ * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+ * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **************************************************************************************************/
+/* \file
+   \brief Defines operations for all GEMM operation kinds in CUTLASS Library.
+*/
+
+#pragma once
+
+#include "cutlass/cutlass.h"
+#include "cutlass/library/library.h"
+#include "cutlass/transform/kernel/sparse_gemm_compressor.hpp" // StructuredSparseCompressor
+#include "cutlass/transform/device/transform_universal_adapter.hpp" // TransformUniversalAdapter
+#include "cutlass/util/packed_stride.hpp"        // make_cute_packed_stride
+#include "gemm_operation_3x.hpp"
+#include "library_internal.h"
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+#define CUDA_CHECK(cuda_error)                                                                 \
+  {                                                                                            \
+    if (cuda_error != cudaSuccess) {                                                           \
+      printf("cudaError %s in %s:%d\n", cudaGetErrorString(cuda_error), __func__, __LINE__ );  \
+      return Status::kInvalid;                                                                 \
+    }                                                                                          \
+  }
+
+namespace cutlass::library {
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+// Limitation & Assumptions: 
+// 1. The tensor must be densely packed.  That is, lda is k if the tensor is k-major,
+//    and lda is m if the tensor is m-major.
+// 2. Circular buffer for tensorA and tensorE may have a less count compared to tensorB and others.
+//    This is because we can not get the problem_count information in the get_device_workspace_size().
+//    But I can promise it will use at least 192MB memory if we enable circular buffer.
+template <typename Operator_>
+class SparseGemmUniversal3xOperation : public GemmOperation3xBase<Operator_> {
+public:
+
+  using Operator = Operator_;
+  using OperatorArguments = typename Operator::Arguments;
+  using ElementA = typename Operator::ElementA;
+  using LayoutA = typename Operator::LayoutA;
+  using ElementB = typename Operator::ElementB;
+  using LayoutB = typename Operator::LayoutB;
+  using ElementC = typename Operator::ElementC;
+  using LayoutC = typename Operator::LayoutC;
+  using ElementD = typename Operator::ElementD;
+  using LayoutD = typename Operator::LayoutD;
+  using ElementAccumulator = typename Operator::ElementAccumulator;
+  using ElementCompute = typename Operator::EpilogueOutputOp::ElementCompute;
+
+  using CollectiveMainloop = typename Operator::CollectiveMainloop;
+  using CollectiveEpilogue = typename Operator::CollectiveEpilogue;
+  using ThreadEpilogueOp = typename CollectiveEpilogue::ThreadEpilogueOp;
+
+  using ElementE = typename CollectiveMainloop::ElementE;
+  using LayoutE = typename CollectiveMainloop::LayoutE;
+  using SparseConfig = typename CollectiveMainloop::SparseConfig;
+  using LayoutATag = decltype(SparseConfig::deduce_layoutA_tag(typename CollectiveMainloop::LayoutA{}));
+  using CompressorUtility = cutlass::transform::kernel::StructuredSparseCompressorUtility<
+                              cute::Shape<int, int, int, int>,
+                              ElementA,
+                              LayoutATag,
+                              SparseConfig>;
+  using CompressorKernel = cutlass::transform::kernel::StructuredSparseCompressor<
+                              cute::Shape<int, int, int, int>,
+                              ElementA,
+                              LayoutATag,
+                              SparseConfig,
+                              typename Operator::ArchTag>;
+
+  using Compressor = cutlass::transform::device::TransformUniversalAdapter<CompressorKernel>;
+
+public:
+
+  /// Constructor
+  SparseGemmUniversal3xOperation(char const *name = "unknown_gemm"):
+    GemmOperation3xBase<Operator_>(name, GemmKind::kUniversal) {}
+
+protected:
+
+  /// Constructs the arguments structure given the configuration and arguments
+  static Status construct_arguments_(
+      OperatorArguments &operator_args, GemmUniversalConfiguration const *configuration) {
+    // NOTE: GemmUniversalConfiguration does not contain problem shapes or batch strides
+    // Do nothing here and construct kernel arguments in update_arguments_ instead
+    // We also cannot construct TMA descriptors without all the arguments available
+
+    operator_args.mode = configuration->mode;
+    return Status::kSuccess;
+  }
+
+  template<class FusionArgs, class = void>
+  struct UpdateFusionArgs {
+    static Status update_(FusionArgs const& fusion_args, GemmUniversalArguments const &arguments) {
+      // If a custom EVT is instantiated then it is the users's responsibility
+      // to ensure alpha and beta are updated appropriately
+      return Status::kSuccess;
+    }
+  };
+
+  template<class FusionArgs>
+  struct UpdateFusionArgs<FusionArgs, cute::void_t<decltype(FusionArgs{}.alpha)>> {
+    static Status update_(FusionArgs& fusion_args, GemmUniversalArguments const &arguments) {
+      if (arguments.pointer_mode == ScalarPointerMode::kHost) {
+        fusion_args.alpha = *static_cast<ElementCompute const *>(arguments.alpha);
+        fusion_args.beta = *static_cast<ElementCompute const *>(arguments.beta);
+        fusion_args.alpha_ptr = nullptr;
+        fusion_args.beta_ptr = nullptr;
+
+        return Status::kSuccess;
+      }
+      else if (arguments.pointer_mode == ScalarPointerMode::kDevice) {
+        fusion_args.alpha = 0;
+        fusion_args.beta = 0;
+        fusion_args.alpha_ptr = static_cast<ElementCompute const *>(arguments.alpha);
+        fusion_args.beta_ptr = static_cast<ElementCompute const *>(arguments.beta);
+
+        return Status::kSuccess;
+      }
+      else {
+        return Status::kErrorInvalidProblem;
+      }
+    }
+  };
+
+  /// Constructs the arguments structure given the configuration and arguments
+  static Status update_arguments_(
+      OperatorArguments &operator_args,
+      GemmUniversalArguments const *arguments,
+      CompressorUtility const& compressor_utility,
+      void* device_a_compressed_ptr = nullptr,
+      void* device_e_ptr = nullptr) {
+    Status status = Status::kSuccess;
+
+    status = UpdateFusionArgs<decltype(operator_args.epilogue.thread)>::update_(
+      operator_args.epilogue.thread, *arguments);
+    if (status != Status::kSuccess) {
+      return status;
+    }
+
+    // TODO: type erase Arguments structure in 3.0 GEMM
+    operator_args.problem_shape = cute::make_shape(
+      arguments->problem_size.m(),
+      arguments->problem_size.n(),
+      arguments->problem_size.k(),
+      arguments->batch_count);
+
+    // update arguments
+    operator_args.mainloop.ptr_A = reinterpret_cast<ElementA const *>(device_a_compressed_ptr);
+    operator_args.mainloop.ptr_B = static_cast<ElementB const *>(arguments->B);
+    operator_args.mainloop.ptr_E = reinterpret_cast<ElementE const *>(device_e_ptr);
+    operator_args.epilogue.ptr_C = static_cast<ElementC const *>(arguments->C);
+    operator_args.epilogue.ptr_D = static_cast<ElementD       *>(arguments->D);
+
+    operator_args.mainloop.layout_a = compressor_utility.fill_layoutA_from_compressor();
+    operator_args.mainloop.layout_e = compressor_utility.fill_layoutE_from_compressor();
+    operator_args.mainloop.dB = cute::make_int_tuple_from<typename Operator::GemmKernel::StrideB>(
+        arguments->ldb, arguments->batch_stride_B);
+    operator_args.epilogue.dC = cute::make_int_tuple_from<typename Operator::GemmKernel::StrideC>(
+        arguments->ldc, arguments->batch_stride_C);
+    operator_args.epilogue.dD = operator_args.epilogue.dC;
+
+    /* Query device SM count to pass onto the kernel as an argument, where needed */
+    operator_args.hw_info.sm_count = arguments->sm_count;
+    if constexpr (!std::is_const_v<decltype(operator_args.scheduler.max_swizzle_size)>) {
+      operator_args.scheduler.max_swizzle_size = arguments->swizzle_size;
+    }
+
+    if constexpr (!std::is_const_v<decltype(operator_args.scheduler.raster_order)>) {
+      using Enum_t = decltype(operator_args.scheduler.raster_order);
+      switch (arguments->raster_order) {
+        case RasterOrder::kAlongN:
+          operator_args.scheduler.raster_order = Enum_t::AlongN;
+          break;
+        case RasterOrder::kAlongM:
+          operator_args.scheduler.raster_order = Enum_t::AlongM;
+          break;
+        default:
+          operator_args.scheduler.raster_order = Enum_t::Heuristic;
+      }
+    }
+
+    return status;
+  }
+
+public:
+
+  /// Returns success if the operation can proceed
+  Status can_implement(
+      void const *configuration_ptr, void const *arguments_ptr) const override {
+
+    GemmUniversalConfiguration const *configuration =
+      static_cast<GemmUniversalConfiguration const *>(configuration_ptr);
+    GemmUniversalArguments const *arguments =
+      static_cast<GemmUniversalArguments const *>(arguments_ptr);
+
+    OperatorArguments args;
+    auto problem_shape_MNKL = cute::make_shape(
+      configuration->problem_size.m(),
+      configuration->problem_size.n(),
+      configuration->problem_size.k(),
+      configuration->batch_count);
+
+    const int M = configuration->problem_size.m();
+    const int N = configuration->problem_size.n();
+    const int K = configuration->problem_size.k();
+    const int L = configuration->batch_count;
+    using StrideA = typename CompressorUtility::StrideA;
+    auto dA = cutlass::make_cute_packed_stride(StrideA{}, cute::make_shape(M, K, L));
+    compressor_utility.set_problem_size(problem_shape_MNKL, dA);
+    auto status = update_arguments_(args, arguments, compressor_utility);
+    if (status != Status::kSuccess) {
+      return status;
+    }
+
+    // can_implement rules may need access to problem shape
+    args.problem_shape = problem_shape_MNKL;
+    return Operator::can_implement(args);
+  }
+
+  /// Gets the host-side workspace
+  uint64_t get_host_workspace_size(void const *) const override {
+    // Memory to hold operator
+    host_op_workspace_size = sizeof(Operator);
+
+    // Memory to hold result of `.structure_sparse_zero_mask_fill()`
+    tensor_a_size          = compressor_utility.get_raw_tensor_A_bytes();
+
+    // NOTE: order here is the order of workspace partition
+    const uint64_t size = host_op_workspace_size + tensor_a_size;
+
+    return size;
+  }
+
+  /// Gets the device-side workspace
+  uint64_t get_device_workspace_size(
+    void const *configuration_ptr,void const *arguments_ptr) const override {
+
+    OperatorArguments args;
+    auto status = update_arguments_(
+      args, static_cast<GemmUniversalArguments const *>(arguments_ptr), compressor_utility);
+    if (status != Status::kSuccess) {
+      return 0;
+    }
+
+    typename Compressor::Arguments compress_arguments {
+      {compressor_utility.M, 0, compressor_utility.K, compressor_utility.L},
+      {/*Empty Not Use*/},
+      {/*Empty Not Use*/} };
+
+    // Size for one iteration
+    // For multi-iteration, will need to multiply result of this function w/ actual problem_count
+    tensor_ac_size           = compressor_utility.get_compressed_tensor_A_bytes();
+    tensor_e_size            = compressor_utility.get_tensor_E_bytes();
+    device_op_workspace_size = Operator::get_workspace_size(args);
+    device_compress_workspace_size = Compressor::get_workspace_size(compress_arguments);
+
+    // NOTE: order here is the order of workspace partition
+    device_per_iter_workspace_size = device_op_workspace_size + device_compress_workspace_size + tensor_ac_size + tensor_e_size;
+
+    return device_per_iter_workspace_size;
+  }
+
+  /// Initializes the workspace
+  Status initialize(
+      void const *configuration_ptr,
+      void *host_workspace,
+      void *device_workspace,
+      cudaStream_t stream = nullptr) const override {
+    return Status::kErrorInternal;
+  }
+
+  Status initialize_with_profiler_workspace(
+      void const *configuration, 
+      void *host_workspace, 
+      void *device_workspace, 
+      uint8_t **profiler_workspaces,
+      int problem_count_from_profiler,
+      cudaStream_t stream = nullptr) {
+
+    // Set problem_count.
+    problem_count = problem_count_from_profiler;
+
+    // * Host Ptr
+    auto* host_op_workspace_ptr       = reinterpret_cast<uint8_t*>(host_workspace);
+    auto* host_a_raw_ptr              = host_op_workspace_ptr + host_op_workspace_size;
+
+    // * Construct Op
+    Operator *op = new (host_op_workspace_ptr) Operator;
+
+    // * Device Full Ptr
+    device_full_ptr = reinterpret_cast<uint8_t*>(device_workspace);
+
+    // * Device Ptr (1st iteration)
+    // Device workspace : | iter1 | iter2 | iter3 | .. | iterx |
+    //            iteri : op_workspace | tensor_ac | tensor_e
+    auto* device_ptr_iter1                = device_full_ptr;
+    auto* device_op_workspace_ptr_iter1         = device_ptr_iter1;
+    auto* device_compressor_workspace_ptr_iter1 = device_op_workspace_ptr_iter1 + device_op_workspace_size;
+    auto* device_a_compressed_ptr_iter1         = device_compressor_workspace_ptr_iter1 + device_compress_workspace_size;
+    auto* device_e_ptr_iter1                    = device_a_compressed_ptr_iter1 + tensor_ac_size;
+
+    // * Device A Raw Ptr
+    auto* device_a_raw_ptr = profiler_workspaces[0];
+
+    // * Random fill 50% of TensorA w/ zero following the structured sparse requirement
+    cudaMemcpy(host_a_raw_ptr, device_a_raw_ptr, tensor_a_size, cudaMemcpyDeviceToHost);
+    compressor_utility.structure_sparse_zero_mask_fill(host_a_raw_ptr, 2000);
+    cudaMemcpy(device_a_raw_ptr, host_a_raw_ptr, tensor_a_size, cudaMemcpyHostToDevice);
+
+    CUDA_CHECK(cudaGetLastError());
+
+    // * Compress DTensorA and get DTensorAC & DTensorE
+    cutlass::KernelHardwareInfo hw_info;
+    hw_info.device_id = 0;
+    hw_info.sm_count = cutlass::KernelHardwareInfo::query_device_multiprocessor_count(hw_info.device_id);
+    typename Compressor::Arguments arguments{
+        {compressor_utility.M, 0, compressor_utility.K, compressor_utility.L},
+        {device_a_raw_ptr,
+         compressor_utility.dA,
+         device_a_compressed_ptr_iter1,
+         device_e_ptr_iter1},
+        {hw_info}
+    };
+
+    cutlass::Status status {cutlass::Status::kSuccess };
+
+    Compressor compressor_op;
+    status = compressor_op.can_implement(arguments);
+    if (status != Status::kSuccess) {
+      return status;
+    }
+
+    status = compressor_op.initialize(arguments, device_compressor_workspace_ptr_iter1, stream);
+    if (status != Status::kSuccess) {
+       return status;
+    }
+
+    status = compressor_op.run(stream);
+    if (status != Status::kSuccess) {
+       return status;
+    }
+
+    CUDA_CHECK(cudaStreamSynchronize(stream));
+
+    // * Copy Iter1's DTensorAC DTensorE to each iteration's DTensorAC DTensorE
+    for (int iter_i = 1; iter_i < problem_count; iter_i++) {
+      // * Device AC E Ptr per iteration
+      // Device workspace : | iter1 | iter2 | iter3 | .. | iterx |
+      //            iteri : op_workspace | tensor_ac | tensor_e
+      auto* device_ptr_iteri                = device_full_ptr         + device_per_iter_workspace_size * iter_i;
+      auto* device_op_workspace_ptr         = device_ptr_iteri;
+      auto* device_compressor_workspace_ptr = device_op_workspace_ptr + device_op_workspace_size;
+      auto* device_a_compressed_ptr         = device_compressor_workspace_ptr + device_compress_workspace_size;
+      auto* device_e_ptr                    = device_a_compressed_ptr + tensor_ac_size;
+
+      cudaMemcpy(device_a_compressed_ptr, device_a_compressed_ptr_iter1, tensor_ac_size, cudaMemcpyDeviceToDevice);
+      cudaMemcpy(device_e_ptr, device_e_ptr_iter1, tensor_e_size, cudaMemcpyDeviceToDevice);
+    }
+
+    CUDA_CHECK(cudaGetLastError());
+
+    return Status::kSuccess;
+  }
+
+  /// Runs the kernel
+  Status run(
+      void const *arguments_ptr,
+      void *host_workspace,
+      void *device_workspace = nullptr,
+      cudaStream_t stream = nullptr) const override {
+
+    OperatorArguments operator_args;
+
+    auto* device_ptr_iteri                = device_full_ptr         + device_per_iter_workspace_size * iter_idx;
+    auto* device_op_workspace_ptr         = device_ptr_iteri;
+    auto* device_compressor_workspace_ptr = device_op_workspace_ptr + device_op_workspace_size;
+    auto* device_a_compressed_ptr         = device_compressor_workspace_ptr + device_compress_workspace_size;
+    auto* device_e_ptr                    = device_a_compressed_ptr + tensor_ac_size;
+    iter_idx = (iter_idx + 1) % problem_count;
+
+    Status status = update_arguments_(operator_args, static_cast<GemmUniversalArguments const *>(arguments_ptr), compressor_utility, device_a_compressed_ptr, device_e_ptr );
+
+    if (status != Status::kSuccess) {
+      return status;
+    }
+
+    Operator *op = static_cast<Operator *>(host_workspace);
+    // We need to call initialize() since we have to rebuild TMA desc for every new set of args
+    status = op->run(operator_args, device_op_workspace_ptr, stream);
+    return status;
+  }
+
+private:
+  // Variables that must change in the const functions.
+  mutable CompressorUtility compressor_utility;
+  mutable int problem_count = 1;
+  mutable int iter_idx = 0;
+
+  uint8_t* device_full_ptr = nullptr;
+
+  mutable uint64_t tensor_ac_size = 0;
+  mutable uint64_t tensor_e_size = 0;
+  mutable uint64_t tensor_a_size = 0;
+  mutable uint64_t host_op_workspace_size = 0;
+  mutable uint64_t device_compress_workspace_size = 0;
+  mutable uint64_t device_op_workspace_size = 0;
+  mutable uint64_t device_per_iter_workspace_size = 0;
+};
+///////////////////////////////////////////////////////////////////////////////////////////////////
+
+} // namespace cutlass::library
+
+///////////////////////////////////////////////////////////////////////////////////////////////////
diff --git a/tools/library/src/util.cu b/tools/library/src/util.cu
index 314ec6c0fe..dee9481001 100644
--- a/tools/library/src/util.cu
+++ b/tools/library/src/util.cu
@@ -756,6 +756,7 @@ OpcodeClassID_enumerants[] = {
   {"tensorop", "<tensorop>", OpcodeClassID::kTensorOp},
   {"wmmatensorop", "<wmmatensorop>", OpcodeClassID::kWmmaTensorOp},
   {"wmma", "<wmma>", OpcodeClassID::kWmmaTensorOp},
+  {"sptensorop", "<sptensorop>", OpcodeClassID::kSparseTensorOp}
 };
 
 /// Converts a OpcodeClassID enumerant to a string
diff --git a/tools/profiler/include/cutlass/profiler/cublas_helpers.h b/tools/profiler/include/cutlass/profiler/cublas_helpers.h
index 1f4be42f6b..32875245fd 100644
--- a/tools/profiler/include/cutlass/profiler/cublas_helpers.h
+++ b/tools/profiler/include/cutlass/profiler/cublas_helpers.h
@@ -36,6 +36,7 @@
 
 #if CUTLASS_ENABLE_CUBLAS
 #include <cublas_v2.h>
+#include <cublasLt.h>
 
 #include "cutlass/cutlass.h"
 #include "cutlass/library/library.h"
@@ -90,25 +91,48 @@ Status cublas_satisfies(library::SymmDescription const &desc);
 /// Additionally, it provides implicit cast from CublasCreate's object to cublasHandle_t's object
 class CublasCreate {
 private:
-	cublasHandle_t handle;
-	cublasStatus_t status;
+  cublasHandle_t handle;
+  cublasStatus_t status;
 
 public:
-	CublasCreate() {
-		status = cublasCreate(&handle);
-	}
+  CublasCreate() {
+    status = cublasCreate(&handle);
+  }
 
-	~CublasCreate() {
-		cublasDestroy(handle);
-	}
+  ~CublasCreate() {
+    cublasDestroy(handle);
+  }
 
-    /// Implicit cast CublasCreate object to cublasHandle_t
-    operator cublasHandle_t() const { return handle; }
+  /// Implicit cast CublasCreate object to cublasHandle_t
+  operator cublasHandle_t() const { return handle; }
 
-    /// returns cublasStatus_t for handle creation
-    cublasStatus_t get_cublas_create_status() { return status; }
+  /// returns cublasStatus_t for handle creation
+  cublasStatus_t get_cublas_create_status() { return status; }
 };
 
+/// This is a helper class to create cublasLtHandle_t automatically on CublasLtCreate object creation and 
+/// to destroy cublasLtHandle_t on CublasLtCreate object destruction. 
+/// Additionally, it provides implicit cast from CublasLtCreate's object to cublasLtHandle_t's object
+class CublasLtCreate {
+private:
+  cublasLtHandle_t handle;
+  cublasStatus_t status;
+
+public:
+  CublasLtCreate() {
+    status = cublasLtCreate(&handle);
+  }
+
+  ~CublasLtCreate() {
+    cublasLtDestroy(handle);
+  }
+
+  /// Implicit cast CublasLtCreate object to cublasLtHandle_t
+  operator cublasLtHandle_t() const { return handle; }
+
+  /// returns cublasLtStatus_t for handle creation
+  cublasStatus_t get_cublaslt_create_status() { return status; }
+};
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
 namespace detail {
@@ -226,6 +250,80 @@ struct cublasGemmExDispatcher {
   cublasStatus_t operator()(cublasHandle_t handle);
 };
 
+/// Dispatcher to cublaslt kernels 
+//
+struct cublasLtGemmExDispatcher {
+
+  //
+  // Data members
+  //
+  library::GemmDescription const &op_desc;
+  library::GemmUniversalConfiguration configuration;
+  library::GemmUniversalArguments arguments;
+
+  // cublas-specific data structures to fill cublas API call arguments
+  cublasOperation_t trans_A;
+  cublasOperation_t trans_B;
+  cudaDataType_t data_type_A;
+  cudaDataType_t data_type_B;
+  cudaDataType_t data_type_C;
+  cudaDataType_t compute_data_type = CUDA_R_32F;
+
+  //cublasLt-specific data structures
+  cublasLtMatmulDesc_t operationDesc = NULL;
+  cublasLtMatrixLayout_t Adesc = NULL, Bdesc = NULL, Cdesc = NULL, Ddesc = NULL;
+  cublasLtMatmulPreference_t preference = NULL;
+  
+  //is set by call to get_cublaslt_algo()
+  cublasLtMatmulHeuristicResult_t heuristicResult_;
+  void *workspace = nullptr;
+
+  Status status;
+
+#if (__CUDACC_VER_MAJOR__ >= 11)
+  cublasComputeType_t compute_type;
+#endif
+
+  //
+  // Methods
+  //
+
+  cublasLtGemmExDispatcher( 
+    library::GemmDescription const &op_desc,
+    library::GemmUniversalConfiguration configuration_,
+    library::GemmUniversalArguments arguments_
+  );
+
+  /// Initialize the cublasLt variables
+  void initialize_cublaslt();
+  
+
+  /// Runs auto-tuning for the cublas heuristics
+  bool get_cublaslt_algo(cublasLtHandle_t handle,
+    AlgorithmMode algorithm_mode 
+    ); 
+
+  /// Executes GEMM using these arguments
+  cublasStatus_t operator()(cublasLtHandle_t handle);
+
+  ~cublasLtGemmExDispatcher(){
+
+    // descriptors are no longer needed as all GPU work was already enqueued
+    if (preference) cublasLtMatmulPreferenceDestroy(preference);
+    if (Ddesc) cublasLtMatrixLayoutDestroy(Ddesc);
+    if (Cdesc) cublasLtMatrixLayoutDestroy(Cdesc);
+    if (Bdesc) cublasLtMatrixLayoutDestroy(Bdesc);
+    if (Adesc) cublasLtMatrixLayoutDestroy(Adesc);
+    if (operationDesc) cublasLtMatmulDescDestroy(operationDesc);
+
+    if (workspace) {
+      cudaFree(workspace);
+    }
+
+  } 
+
+};
+
 ///////////////////////////////////////////////////////////////////////////////////////////////////
 
 /// Dispatcher to cublas rank k update kernels 
diff --git a/tools/profiler/include/cutlass/profiler/cutlass_profiler.h b/tools/profiler/include/cutlass/profiler/cutlass_profiler.h
index 9aa4d838cf..c5fdc9e38a 100644
--- a/tools/profiler/include/cutlass/profiler/cutlass_profiler.h
+++ b/tools/profiler/include/cutlass/profiler/cutlass_profiler.h
@@ -48,7 +48,7 @@ namespace profiler {
 
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
-/// CUTLASS Profiler application 
+/// CUTLASS Profiler application
 class CutlassProfiler {
 private:
 
@@ -66,13 +66,10 @@ class CutlassProfiler {
 
   /// Prints usage
   void print_usage_(std::ostream &);
-  
+
   /// Prints usage
   void print_options_(std::ostream &);
 
-  /// Initializes the device
-  void initialize_device_();
-
   /// Enumerates all operations
   void enumerate_();
 
diff --git a/tools/profiler/include/cutlass/profiler/device_allocation.h b/tools/profiler/include/cutlass/profiler/device_allocation.h
index 0ae1817a4f..97a1e7224a 100644
--- a/tools/profiler/include/cutlass/profiler/device_allocation.h
+++ b/tools/profiler/include/cutlass/profiler/device_allocation.h
@@ -81,6 +81,9 @@ class DeviceAllocation {
   /// Buffer holding TensorRef instance to recently allocated memory
   std::vector<uint8_t> tensor_ref_buffer_;
 
+  /// The device ID where the allocation is made
+  int device_;
+
 public:
   //
   // Static member functions
@@ -91,7 +94,7 @@ class DeviceAllocation {
 
   /// Returns the stride of a packed layout
   static std::vector<int64_t> get_packed_layout(
-    library::LayoutTypeID layout_id, 
+    library::LayoutTypeID layout_id,
     std::vector<int> const &extent);
 
   /// returns the capacity needed
@@ -103,16 +106,16 @@ class DeviceAllocation {
 
   /// Returns true if two blocks have exactly the same value
   static bool block_compare_equal(
-    library::NumericTypeID numeric_type, 
-    void const *ptr_A, 
-    void const *ptr_B, 
+    library::NumericTypeID numeric_type,
+    void const *ptr_A,
+    void const *ptr_B,
     size_t capacity);
 
   /// Returns true if two blocks have approximately the same value
   static bool block_compare_relatively_equal(
-    library::NumericTypeID numeric_type, 
-    void const *ptr_A, 
-    void const *ptr_B, 
+    library::NumericTypeID numeric_type,
+    void const *ptr_A,
+    void const *ptr_B,
     size_t capacity,
     double epsilon,
     double nonzero_floor);
@@ -123,15 +126,19 @@ class DeviceAllocation {
   //
 
   DeviceAllocation();
-  
-  DeviceAllocation(library::NumericTypeID type, size_t capacity);
-  
+
+  DeviceAllocation(
+    library::NumericTypeID type,
+    size_t capacity,
+    int device = -1);
+
   DeviceAllocation(
-    library::NumericTypeID type, 
-    library::LayoutTypeID layout_id, 
-    std::vector<int> const &extent, 
+    library::NumericTypeID type,
+    library::LayoutTypeID layout_id,
+    std::vector<int> const &extent,
     std::vector<int64_t> const &stride = std::vector<int64_t>(),
-    int batch_count = 1);
+    int batch_count = 1,
+    int device = -1);
 
   ~DeviceAllocation();
 
@@ -142,9 +149,9 @@ class DeviceAllocation {
 
   /// Allocates memory for a given layout and tensor
   DeviceAllocation &reset(
-    library::NumericTypeID type, 
-    library::LayoutTypeID layout_id, 
-    std::vector<int> const &extent, 
+    library::NumericTypeID type,
+    library::LayoutTypeID layout_id,
+    std::vector<int> const &extent,
     std::vector<int64_t> const &stride = std::vector<int64_t>(),
     int batch_count = 1);
 
@@ -157,7 +164,7 @@ class DeviceAllocation {
 
   /// Data type of contained elements
   library::NumericTypeID type() const;
-  
+
   /// Pointer to start of device memory allocation
   void *data() const;
 
@@ -184,7 +191,7 @@ class DeviceAllocation {
 
   /// Capacity of allocation in number of elements
   size_t capacity() const;
-  
+
   /// Capacity of allocation in bytes
   size_t bytes() const;
 
@@ -205,7 +212,7 @@ class DeviceAllocation {
 
   /// Initializes a host allocation to a random distribution using std::cout
   void initialize_random_sparsemeta_host(int seed, int MetaSizeInBits);
-  
+
   /// Uniformly fills a tensor with a value when provided o.w. zero
   void fill_device(double value);
 
@@ -221,8 +228,12 @@ class DeviceAllocation {
   /// Copies from an equivalent-sized tensor in device memory
   void copy_to_host(void *ptr);
 
-  /// Writes a tensor to csv 
+  /// Writes a tensor to csv
   void write_tensor_csv(std::ostream &out);
+
+private:
+  /// A wrapper that sets the device, performs malloc, and sets back
+  cudaError_t malloc(void** ptr, size_t size);
 };
 
 using DeviceAllocationList = std::list<DeviceAllocation>;
diff --git a/tools/profiler/include/cutlass/profiler/device_context.h b/tools/profiler/include/cutlass/profiler/device_context.h
index fa18fc7e2e..19fc42c50f 100644
--- a/tools/profiler/include/cutlass/profiler/device_context.h
+++ b/tools/profiler/include/cutlass/profiler/device_context.h
@@ -29,7 +29,7 @@
  *
  **************************************************************************************************/
 /* \file
-   \brief 
+   \brief
 */
 
 #pragma once
@@ -68,46 +68,52 @@ class DeviceContext {
 
   /// Non-owning set of named allocations
   AllocationMap allocations_;
-  
+
 public:
 
   /// Allocates memory of a given type, capacity (elements), and name
   DeviceAllocation *allocate_block(
+    Options const &options,
     std::string const &name,
-    library::NumericTypeID type, 
-    size_t capacity);
+    library::NumericTypeID type,
+    size_t capacity,
+    size_t device_index);
 
   /// Allocates memory of a given type, capacity (elements), and name
   DeviceAllocation *allocate_tensor(
+    Options const &options,
     std::string const &name,
-    library::NumericTypeID type, 
-    library::LayoutTypeID layout_id, 
-    std::vector<int> const &extent, 
-    std::vector<int64_t> const &stride = std::vector<int64_t>(),
-    int batch_count = 1);
+    library::NumericTypeID type,
+    library::LayoutTypeID layout_id,
+    std::vector<int> const &extent,
+    std::vector<int64_t> const &stride,
+    int batch_count,
+    size_t device_index);
 
   /// Allocates memory of a given type, capacity (elements), and name
-  DeviceAllocation *allocate_tensor(
+  DeviceAllocation *allocate_and_initialize_tensor(
     Options const &options,
     std::string const &name,
-    library::NumericTypeID type, 
-    library::LayoutTypeID layout_id, 
-    std::vector<int> const &extent, 
+    library::NumericTypeID type,
+    library::LayoutTypeID layout_id,
+    std::vector<int> const &extent,
     std::vector<int64_t> const &stride,
     int batch_count,
-    int seed_shift = 0);
+    int seed_shift,
+    size_t device_index);
 
-  /// Allocates memory for sparse meta data 
-  DeviceAllocation *allocate_sparsemeta_tensor(
+  /// Allocates memory for sparse meta data
+  DeviceAllocation *allocate_and_initialize_sparsemeta_tensor(
     Options const &options,
     std::string const &name,
-    library::NumericTypeID type, 
-    library::LayoutTypeID layout_id, 
+    library::NumericTypeID type,
+    library::LayoutTypeID layout_id,
     library::NumericTypeID type_a,
-    std::vector<int> const &extent, 
+    std::vector<int> const &extent,
     std::vector<int64_t> const &stride,
     int batch_count,
-    int seed_shift = 0);
+    int seed_shift,
+    size_t device_index);
 
   /// Clears named allocations (but does not necessarily free memory)
   void clear();
diff --git a/tools/profiler/include/cutlass/profiler/options.h b/tools/profiler/include/cutlass/profiler/options.h
index e945d17344..6093f49bb0 100644
--- a/tools/profiler/include/cutlass/profiler/options.h
+++ b/tools/profiler/include/cutlass/profiler/options.h
@@ -82,12 +82,16 @@ class Options {
   struct Device {
 
     /// Device ID
-    int device;
+    std::vector<int> devices;
+
+    /// Number of total devices
+    /// This is not set by the user, it is set by automatically
+    int num_devices;
 
     /// CUDA Device properties
-    cudaDeviceProp properties;
+    std::vector<cudaDeviceProp> properties;
 
-    /// Total memory allocation on device
+    /// Total memory allocation on each device
     size_t maximum_capacity;
 
     //
@@ -100,8 +104,11 @@ class Options {
     void print_options(std::ostream &out, int indent = 0) const;
     void print_device_info(std::ostream &out) const;
 
-    /// Returns the compute capability of the listed device (e.g. 61, 60, 70, 75)
-    int compute_capability() const;
+    /// Returns the device ID from a device index
+    int device_id(size_t device_index) const;
+
+    /// Returns the compute capability of the listed devices (e.g. 61, 60, 70, 75)
+    int compute_capability(int device_index) const;
   };
 
   /// Options related to initializing input tensors
@@ -129,7 +136,7 @@ class Options {
     //
 
     explicit Initialization(CommandLine const &cmdline);
-    
+
     void print_usage(std::ostream &out) const;
     void print_options(std::ostream &out, int indent = 0) const;
 
@@ -171,13 +178,13 @@ class Options {
     //
 
     explicit Verification(CommandLine const &cmdline);
-  
+
     void print_usage(std::ostream &out) const;
     void print_options(std::ostream &out, int indent = 0) const;
 
     /// Returns true if a provider is enabled
     bool provider_enabled(library::Provider provider) const;
-    
+
     /// Returns the index of a provider if its enabled
     size_t index(library::Provider provider) const;
   };
@@ -225,7 +232,7 @@ class Options {
     /// Returns the index of a provider if its enabled
     size_t index(library::Provider provider) const;
   };
-  
+
   /// Options related to reporting
   struct Report {
 
@@ -260,7 +267,7 @@ class Options {
     //
 
     explicit Report(CommandLine const &cmdline);
-    
+
     void print_usage(std::ostream &out) const;
     void print_options(std::ostream &out, int indent = 0) const;
   };
@@ -282,7 +289,7 @@ class Options {
     //
 
     explicit About(CommandLine const &cmdline);
-    
+
     void print_usage(std::ostream &out) const;
     void print_options(std::ostream &out, int indent = 0) const;
 
@@ -303,7 +310,7 @@ class Options {
 
   /// Vector of operation name substrings
   std::vector<std::string> operation_names;
-  
+
   /// Vector of operation name substrings
   std::vector<std::string> excluded_operation_names;
 
diff --git a/tools/profiler/src/conv2d_operation_profiler.cu b/tools/profiler/src/conv2d_operation_profiler.cu
index 0c3fc0c40c..f74ffbe728 100644
--- a/tools/profiler/src/conv2d_operation_profiler.cu
+++ b/tools/profiler/src/conv2d_operation_profiler.cu
@@ -51,10 +51,10 @@ namespace profiler {
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
 /// Ctor
-Conv2dOperationProfiler::Conv2dOperationProfiler(Options const &options): 
+Conv2dOperationProfiler::Conv2dOperationProfiler(Options const &options):
   OperationProfiler(
     options,
-    library::OperationKind::kConv2d, 
+    library::OperationKind::kConv2d,
     {
       {ArgumentTypeID::kEnumerated, {"conv_kind"}, "Convolutional operator (fprop, dgrad, wgrad)"},
       {ArgumentTypeID::kInteger, {"n", "input_n"}, "Input N dimension of the Conv2d problem space"},
@@ -165,13 +165,13 @@ int64_t Conv2dOperationProfiler::Conv2dProblem::flops(
 
   int64_t flops_mainloop_ = int64_t(mnk.m()) * mnk.n() * mnk.k() * 2;
   int64_t flops_epilogue_ = int64_t(mnk.m()) * int64_t(mnk.n()) * 2;
-  
+
   // Adjust mainloop flop for dgrad strided
   if (operation_desc.conv_kind == library::ConvKind::kDgrad) {
     flops_mainloop_ = flops_mainloop_ / (stride_h * stride_w);
   }
   int64_t flops_total_ = flops_mainloop_ + flops_epilogue_;
-  
+
   //complex-valued support
   switch (operation_desc.tile_description.math_instruction.math_operation) {
   case library::MathOperationID::kMultiplyAddComplex:
@@ -188,14 +188,14 @@ int64_t Conv2dOperationProfiler::Conv2dProblem::flops(
 
 /// Extracts the problem dimensions
 Status Conv2dOperationProfiler::initialize_configuration(
-  Options const &options,  
+  Options const &options,
   PerformanceReport &report,
   DeviceContext &device_context,
   library::Operation const *operation,
   ProblemSpace const &problem_space,
   ProblemSpace::Problem const &problem) {
 
-  library::ConvDescription const &operation_desc = 
+  library::ConvDescription const &operation_desc =
     static_cast<library::ConvDescription const &>(operation->description());
 
   if (!arg_as_int(problem_.n, "n", problem_space, problem)) {
@@ -207,7 +207,7 @@ Status Conv2dOperationProfiler::initialize_configuration(
     // default value
     problem_.h = 16;
   }
-  
+
   if (!arg_as_int(problem_.w, "w", problem_space, problem)) {
     // default value
     problem_.w = 16;
@@ -227,7 +227,7 @@ Status Conv2dOperationProfiler::initialize_configuration(
     // default value
     problem_.r = 3;
   }
-  
+
   if (!arg_as_int(problem_.s, "s", problem_space, problem)) {
     // default value
     problem_.s = 3;
@@ -280,14 +280,14 @@ Status Conv2dOperationProfiler::initialize_configuration(
   // cutlass profiler sets p and q which are cuDNN compliant.                           //
   //                                                                                    //
   ////////////////////////////////////////////////////////////////////////////////////////
-  // set convolution output p 
+  // set convolution output p
   if (!arg_as_int(problem_.p, "p", problem_space, problem)) {
     // default value (set using cudnn formula for output height, when p is not provided)
     problem_.p = (
-                    problem_.h + 
-                    2 * problem_.pad_h - 
+                    problem_.h +
+                    2 * problem_.pad_h -
                     ((problem_.r - 1) * problem_.dilation_h + 1)
-                 ) / (problem_.stride_h) 
+                 ) / (problem_.stride_h)
                 + 1;
   }
 
@@ -295,10 +295,10 @@ Status Conv2dOperationProfiler::initialize_configuration(
   if (!arg_as_int(problem_.q, "q", problem_space, problem)) {
     // default value (set using cudnn formula for output width, when q is not provided)
     problem_.q = (
-                    problem_.w + 
-                    2 * problem_.pad_w - 
+                    problem_.w +
+                    2 * problem_.pad_w -
                     ((problem_.s - 1) * problem_.dilation_w + 1)
-                 ) / (problem_.stride_w) 
+                 ) / (problem_.stride_w)
                 + 1;
   }
   /////////////////////////////////////////////////////////////////////////////////////////
@@ -313,7 +313,7 @@ Status Conv2dOperationProfiler::initialize_configuration(
     // default value
     problem_.split_k_slices = 1;
   }
-  
+
   if (!arg_as_ConvModeID(problem_.conv_mode, "conv_mode", problem_space, problem)) {
     // default value
     problem_.conv_mode = library::ConvModeID::kCrossCorrelation;
@@ -345,24 +345,24 @@ Status Conv2dOperationProfiler::initialize_configuration(
   }
 
   if (!arg_as_scalar(
-    problem_.alpha, 
-    operation_desc.element_epilogue, 
-    "alpha", 
-    problem_space, 
+    problem_.alpha,
+    operation_desc.element_epilogue,
+    "alpha",
+    problem_space,
     problem)) {
 
     if (!cast_from_double(problem_.alpha, operation_desc.element_epilogue, 1)) {
       return Status::kErrorInternal;
     }
   }
-  
+
   if (!arg_as_scalar(
-    problem_.beta, 
-    operation_desc.element_epilogue, 
-    "beta", 
-    problem_space, 
+    problem_.beta,
+    operation_desc.element_epilogue,
+    "beta",
+    problem_space,
     problem)) {
-    
+
     if (!cast_from_double(problem_.beta, operation_desc.element_epilogue, 0)) {
       return Status::kErrorInternal;
     }
@@ -389,7 +389,7 @@ Status Conv2dOperationProfiler::initialize_configuration(
                                                 int(problem_.split_k_slices),
                                                 int(problem_.groups)
                                               );
-  
+
   conv_workspace_.configuration.split_k_mode = static_cast<conv::SplitKMode>(static_cast<int>(problem_.split_k_mode));
 
   conv_workspace_.set_stride_vector(
@@ -420,7 +420,7 @@ Status Conv2dOperationProfiler::initialize_configuration(
 /// Initializes the performance result
 void Conv2dOperationProfiler::initialize_result_(
   PerformanceResult &result,
-  Options const &options,  
+  Options const &options,
   library::ConvDescription const &operation_desc,
   ProblemSpace const &problem_space) {
 
@@ -432,15 +432,15 @@ void Conv2dOperationProfiler::initialize_result_(
   result.arguments.resize(problem_space.rank());
 
   set_argument(result, "Activation", problem_space,
-    std::string(library::to_string(operation_desc.activation().element)) 
+    std::string(library::to_string(operation_desc.activation().element))
     + ":" + library::to_string(operation_desc.activation().layout));
 
   set_argument(result, "Filter", problem_space,
-    std::string(library::to_string(operation_desc.filter().element)) 
+    std::string(library::to_string(operation_desc.filter().element))
     + ":" + library::to_string(operation_desc.filter().layout));
 
   set_argument(result, "Output", problem_space,
-    std::string(library::to_string(operation_desc.output().element)) 
+    std::string(library::to_string(operation_desc.output().element))
     + ":" + library::to_string(operation_desc.output().layout));
 
   set_argument(result, "conv_kind", problem_space, library::to_string(operation_desc.conv_kind));
@@ -455,7 +455,7 @@ void Conv2dOperationProfiler::initialize_result_(
   set_argument(result, "k", problem_space, problem_.k);
   set_argument(result, "r", problem_space, problem_.r);
   set_argument(result, "s", problem_space, problem_.s);
-  
+
   set_argument(result, "p", problem_space, problem_.p);
   set_argument(result, "q", problem_space, problem_.q);
 
@@ -470,11 +470,11 @@ void Conv2dOperationProfiler::initialize_result_(
   set_argument(result, "dilation_h", problem_space, problem_.dilation_h);
   set_argument(result, "dilation_w", problem_space, problem_.dilation_w);
 
-  set_argument(result, "split_k_mode", problem_space, 
+  set_argument(result, "split_k_mode", problem_space,
     std::string(library::to_string(problem_.split_k_mode)));
   set_argument(result, "split_k_slices", problem_space, problem_.split_k_slices);
 
-  set_argument(result, "conv_mode", problem_space, 
+  set_argument(result, "conv_mode", problem_space,
     std::string(library::to_string(problem_.conv_mode)));
 
   set_argument(result, "alpha", problem_space,
@@ -483,19 +483,19 @@ void Conv2dOperationProfiler::initialize_result_(
   set_argument(result, "beta", problem_space,
     library::lexical_cast(problem_.beta, operation_desc.element_epilogue));
 
-  set_argument(result, "eq_gemm_provider", problem_space, 
+  set_argument(result, "eq_gemm_provider", problem_space,
     std::string(library::to_string(problem_.eq_gemm_provider)));
 
   OperationProfiler::initialize_result_(result, operation_desc, problem_space);
 
   // Bytes of activation, filter, and output tensors
-  int64_t activation_bytes = int64_t(library::sizeof_bits(operation_desc.activation().element) / 8) * 
+  int64_t activation_bytes = int64_t(library::sizeof_bits(operation_desc.activation().element) / 8) *
     conv_workspace_.configuration.problem_size.activation_size();
 
-  int64_t filter_bytes = int64_t(library::sizeof_bits(operation_desc.filter().element) / 8) * 
+  int64_t filter_bytes = int64_t(library::sizeof_bits(operation_desc.filter().element) / 8) *
     conv_workspace_.configuration.problem_size.filter_size();
 
-  int64_t output_bytes = int64_t(library::sizeof_bits(operation_desc.output().element) / 8) * 
+  int64_t output_bytes = int64_t(library::sizeof_bits(operation_desc.output().element) / 8) *
     conv_workspace_.configuration.problem_size.output_size();
 
   // Bytes of activation, filter, and output tensors
@@ -511,14 +511,14 @@ void Conv2dOperationProfiler::initialize_result_(
 
 /// Initialize reduction problem dimensions and library::Operation
 bool Conv2dOperationProfiler::initialize_reduction_configuration_(
-  Options const &options,  
+  Options const &options,
   PerformanceReport &report,
   DeviceContext &device_context,
   library::Operation const *operation,
   ProblemSpace const &problem_space,
   ProblemSpace::Problem const &problem) {
 
-  library::ConvDescription const &conv_desc = 
+  library::ConvDescription const &conv_desc =
     static_cast<library::ConvDescription const &>(operation->description());
 
   library::ConvKind const &conv_kind = conv_desc.conv_kind;
@@ -545,14 +545,14 @@ bool Conv2dOperationProfiler::initialize_reduction_configuration_(
   conv_workspace_.reduction_configuration.ldd =
       conv_workspace_.configuration.stride_c[tensor_c_stride_idx];
 
-  // find reduction operation 
+  // find reduction operation
   library::ReductionFunctionalKey reduction_key(
     library::Provider::kCUTLASS,
-    conv_desc.tile_description.math_instruction.element_accumulator,  // element workspace 
+    conv_desc.tile_description.math_instruction.element_accumulator,  // element workspace
     conv_desc.tile_description.math_instruction.element_accumulator,  // element accumulator
     conv_desc.C.element,                                              // element output
     conv_desc.element_epilogue                                        // element compute
-  ); 
+  );
 
 #if 0// debug print to check which reduction instance is selected
     std::cout << reduction_key << "\n";
@@ -562,7 +562,7 @@ bool Conv2dOperationProfiler::initialize_reduction_configuration_(
   if(reduction_it == Singleton::get().operation_table.reduction_operations.end()) {
 
     return false;
-  }    
+  }
 
   // initialize reduction operation required for parallel split-k conv2d operator
   reduction_op_ = reduction_it->second;
@@ -574,13 +574,24 @@ bool Conv2dOperationProfiler::initialize_reduction_configuration_(
 
 /// Initializes workspace
 Status Conv2dOperationProfiler::initialize_workspace(
-  Options const &options,  
+  Options const &options,
   PerformanceReport &report,
   DeviceContext &device_context,
   library::Operation const *operation,
   ProblemSpace const &problem_space,
   ProblemSpace::Problem const &problem) {
 
+  if (options.device.devices.size() != 1) {
+    throw std::runtime_error("This operation profiler only supports a single "
+                             "device.");
+  }
+
+  cudaError_t result;
+  result = cudaSetDevice(options.device.device_id(0));
+  if (result != cudaSuccess) {
+    throw std::runtime_error("cudaSetDevice() failed.");
+  }
+
   // initialize conv2d underlying operation to handle parallel reduction
   library::Operation const* underlying_operation = operation;
 
@@ -590,15 +601,15 @@ Status Conv2dOperationProfiler::initialize_workspace(
     }
   }
 
-  library::ConvDescription const &operation_desc = 
+  library::ConvDescription const &operation_desc =
     static_cast<library::ConvDescription const &>(underlying_operation->description());
 
   // Compute the number of copies of the problem to avoid L2 camping.
   if (!options.profiling.workspace_count) {
     int64_t bytes = problem_.bytes(operation_desc);
-    if (bytes < 3 * int64_t(options.device.properties.l2CacheSize)) {
+    if (bytes < 3 * int64_t(options.device.properties[0].l2CacheSize)) {
       conv_workspace_.problem_count =
-        1 + int((3 * int64_t(options.device.properties.l2CacheSize)) / bytes);
+        1 + int((3 * int64_t(options.device.properties[0].l2CacheSize)) / bytes);
     }
     else {
       conv_workspace_.problem_count = 1;
@@ -611,7 +622,7 @@ Status Conv2dOperationProfiler::initialize_workspace(
 
   if (options.execution_mode != ExecutionMode::kDryRun) {
     int seed_shift = 0;
-    conv_workspace_.A = device_context.allocate_tensor(
+    conv_workspace_.A = device_context.allocate_and_initialize_tensor(
       options,
       "A",
       operation_desc.A.element,
@@ -619,10 +630,11 @@ Status Conv2dOperationProfiler::initialize_workspace(
       problem_.extent_a(operation_desc.conv_kind),
       conv_workspace_.configuration.stride_a,
       conv_workspace_.problem_count,
-      seed_shift++
+      seed_shift++,
+      0 // device_index
     );
 
-    conv_workspace_.B = device_context.allocate_tensor(
+    conv_workspace_.B = device_context.allocate_and_initialize_tensor(
       options,
       "B",
       operation_desc.B.element,
@@ -630,12 +642,13 @@ Status Conv2dOperationProfiler::initialize_workspace(
       problem_.extent_b(operation_desc.conv_kind),
       conv_workspace_.configuration.stride_b,
       conv_workspace_.problem_count,
-      seed_shift++
+      seed_shift++,
+      0 // device_index
     );
 
     if(problem_.groups == problem_.c && problem_.groups == problem_.k){
       // Depthwise direct conv kernel needs reorder the filter.
-      conv_workspace_.reordered_B = device_context.allocate_tensor(
+      conv_workspace_.reordered_B = device_context.allocate_and_initialize_tensor(
         options,
         "B",
         operation_desc.B.element,
@@ -643,11 +656,12 @@ Status Conv2dOperationProfiler::initialize_workspace(
         problem_.extent_b(operation_desc.conv_kind),
         conv_workspace_.configuration.stride_b,
         conv_workspace_.problem_count,
-        seed_shift++
+        seed_shift++,
+        0 // device_index
       );
     }
 
-    conv_workspace_.C = device_context.allocate_tensor(
+    conv_workspace_.C = device_context.allocate_and_initialize_tensor(
       options,
       "C",
       operation_desc.C.element,
@@ -655,25 +669,30 @@ Status Conv2dOperationProfiler::initialize_workspace(
       problem_.extent_c(operation_desc.conv_kind),
       conv_workspace_.configuration.stride_c,
       conv_workspace_.problem_count,
-      seed_shift++
+      seed_shift++,
+      0 // device_index
     );
 
     conv_workspace_.Computed = device_context.allocate_tensor(
+      options,
       "D",
       operation_desc.C.element,
       operation_desc.C.layout,
       problem_.extent_c(operation_desc.conv_kind),
       conv_workspace_.configuration.stride_c,
-      conv_workspace_.problem_count
+      conv_workspace_.problem_count,
+      0 // device_index
     );
 
     conv_workspace_.Reference = device_context.allocate_tensor(
+      options,
       "Reference",
       operation_desc.C.element,
       operation_desc.C.layout,
       problem_.extent_c(operation_desc.conv_kind),
       conv_workspace_.configuration.stride_c,
-      conv_workspace_.problem_count
+      conv_workspace_.problem_count,
+      0 // device_index
     );
   }
 
@@ -706,10 +725,10 @@ Status Conv2dOperationProfiler::initialize_workspace(
         conv_workspace_.reduction_host_workspace.resize(workspace_size, 0);
 
         status = reduction_op_->initialize(
-          &conv_workspace_.reduction_configuration, 
-          conv_workspace_.reduction_host_workspace.data(), 
+          &conv_workspace_.reduction_configuration,
+          conv_workspace_.reduction_host_workspace.data(),
           nullptr);
-        
+
         if (status != Status::kSuccess) {
           return status;
         }
@@ -736,7 +755,7 @@ Status Conv2dOperationProfiler::initialize_workspace(
 
 /// Verifies CUTLASS against references
 bool Conv2dOperationProfiler::verify_cutlass(
-  Options const &options,  
+  Options const &options,
   PerformanceReport &report,
   DeviceContext &device_context,
   library::Operation const *operation,
@@ -769,7 +788,7 @@ bool Conv2dOperationProfiler::verify_cutlass(
   }
 
   conv_workspace_.Computed->copy_from_device(conv_workspace_.C->data());
-  
+
   if (conv_workspace_.configuration.split_k_mode == conv::SplitKMode::kParallel) {
     // update library::ConvArguments for parallel split-k reduction
     conv_workspace_.arguments.D = conv_workspace_.device_workspace.data();
@@ -799,9 +818,9 @@ bool Conv2dOperationProfiler::verify_cutlass(
   }
 
 #if 0
-  std::cout << "profiling         : " << std::endl 
-            << "conv2d            : " << operation->description().name << std::endl 
-            << "underlying conv2d : " << underlying_operation->description().name << std::endl 
+  std::cout << "profiling         : " << std::endl
+            << "conv2d            : " << operation->description().name << std::endl
+            << "underlying conv2d : " << underlying_operation->description().name << std::endl
             << "reduction         : " << reduction_op_->description().name << std::endl;
 #endif
 
@@ -818,7 +837,7 @@ bool Conv2dOperationProfiler::verify_cutlass(
 
   // Run parallel reduction kernel for parallel split_k_mode
   if (conv_workspace_.configuration.split_k_mode == conv::SplitKMode::kParallel) {
-    
+
     results_.back().status = reduction_op_->run(
       &conv_workspace_.reduction_arguments,
       conv_workspace_.reduction_host_workspace.data(),
@@ -840,7 +859,7 @@ bool Conv2dOperationProfiler::verify_cutlass(
 
   // CUTLASS op ran the but not yet verified against any verification provider
   results_.back().disposition = Disposition::kNotVerified;
-  
+
   //
   // Run verification providers
   //
@@ -856,7 +875,7 @@ bool Conv2dOperationProfiler::verify_cutlass(
 
       Status status = cudnn_satisfies(conv_desc, conv_workspace_.configuration);
 
-      // Initialize reference data to the source data 
+      // Initialize reference data to the source data
       conv_workspace_.Reference->copy_from_device(conv_workspace_.C->data());
 
       if (status == Status::kSuccess) {
@@ -884,7 +903,7 @@ bool Conv2dOperationProfiler::verify_cutlass(
     // Run verification device reference
     if (options.verification.provider_enabled(library::Provider::kReferenceDevice)) {
 
-      // Restore reference data back to initial source data 
+      // Restore reference data back to initial source data
       conv_workspace_.Reference->copy_from_device(conv_workspace_.C->data());
 
       verify_with_device_reference_(
@@ -893,13 +912,13 @@ bool Conv2dOperationProfiler::verify_cutlass(
         device_context,
         operation,
         problem_space,
-        problem);      
+        problem);
     }
 
     // Run verification host reference
     if (options.verification.provider_enabled(library::Provider::kReferenceHost)) {
-      
-      // Restore reference data back to initial source data 
+
+      // Restore reference data back to initial source data
       conv_workspace_.Reference->copy_from_device(conv_workspace_.C->data());
 
       verify_with_host_reference_(
@@ -908,10 +927,10 @@ bool Conv2dOperationProfiler::verify_cutlass(
         device_context,
         operation,
         problem_space,
-        problem);      
+        problem);
     }
 
-    // Update disposition to worst case verification outcome among all 
+    // Update disposition to worst case verification outcome among all
     // verification providers which are supported
     bool is_any_verification_run_passed = false;
     for(auto &m : results_.back().verification_map) {
@@ -936,7 +955,7 @@ bool Conv2dOperationProfiler::verify_cutlass(
 
 /// Verifies CUTLASS against host reference
 bool Conv2dOperationProfiler::verify_with_host_reference_(
-  Options const &options,  
+  Options const &options,
   PerformanceReport &report,
   DeviceContext &device_context,
   library::Operation const *operation,
@@ -954,14 +973,14 @@ bool Conv2dOperationProfiler::verify_with_host_reference_(
 
     library::ConvFunctionalKey conv2d_key(
       library::Provider::kReferenceHost,
-      conv_desc.conv_kind,        
+      conv_desc.conv_kind,
       conv_desc.A.element,
       conv_desc.A.layout,
       conv_desc.B.element,
       conv_desc.B.layout,
       conv_desc.C.element,
       conv_desc.C.layout,
-      conv_desc.tile_description.math_instruction.element_accumulator, 
+      conv_desc.tile_description.math_instruction.element_accumulator,
       conv_desc.element_epilogue);
 
 #if 0 // debug print to check which host reference instance is selected
@@ -974,12 +993,12 @@ bool Conv2dOperationProfiler::verify_with_host_reference_(
 
       results_.back().verification_map[library::Provider::kReferenceHost] = Disposition::kNotRun;
       return true;
-    }    
+    }
 
     // conv2d host reference minimum cc is 0 (CPU) and no iterator algorithm
     library::ConvPreferenceKey preference_key(0, library::IteratorAlgorithmID::kNone);
     auto cc_it = operators_it->second.find(preference_key);
-    
+
     if(cc_it == operators_it->second.end()) {
       results_.back().verification_map[library::Provider::kReferenceHost] = Disposition::kNotRun;
       return true;
@@ -1052,9 +1071,9 @@ bool Conv2dOperationProfiler::verify_with_host_reference_(
     );
 
     // Save workspace if incorrect
-    if (options.verification.save_workspace == SaveWorkspace::kIncorrect && 
+    if (options.verification.save_workspace == SaveWorkspace::kIncorrect &&
       results_.back().verification_map[library::Provider::kReferenceHost] == Disposition::kIncorrect) {
-  
+
       save_workspace(
         device_context,
         options,
@@ -1070,7 +1089,7 @@ bool Conv2dOperationProfiler::verify_with_host_reference_(
 
 /// Verifies CUTLASS against host reference
 bool Conv2dOperationProfiler::verify_with_device_reference_(
-  Options const &options,  
+  Options const &options,
   PerformanceReport &report,
   DeviceContext &device_context,
   library::Operation const *operation,
@@ -1088,14 +1107,14 @@ bool Conv2dOperationProfiler::verify_with_device_reference_(
 
     library::ConvFunctionalKey conv2d_key(
       library::Provider::kReferenceDevice,
-      conv_desc.conv_kind,      
+      conv_desc.conv_kind,
       conv_desc.A.element,
       conv_desc.A.layout,
       conv_desc.B.element,
       conv_desc.B.layout,
       conv_desc.C.element,
       conv_desc.C.layout,
-      conv_desc.tile_description.math_instruction.element_accumulator, 
+      conv_desc.tile_description.math_instruction.element_accumulator,
       conv_desc.element_epilogue);
 
     auto operators_it = Singleton::get().operation_table.conv2d_operations.find(conv2d_key);
@@ -1105,12 +1124,12 @@ bool Conv2dOperationProfiler::verify_with_device_reference_(
       results_.back().verification_map[library::Provider::kReferenceDevice] = Disposition::kNotRun;
 
       return true;
-    }    
+    }
 
     // conv2d device reference minimum cc is 50 and no iterator algorithm
     library::ConvPreferenceKey preference_key(50, library::IteratorAlgorithmID::kNone);
     auto cc_it = operators_it->second.find(preference_key);
-    
+
     if(cc_it == operators_it->second.end()) {
       results_.back().verification_map[library::Provider::kReferenceDevice] = Disposition::kNotRun;
 
@@ -1119,7 +1138,7 @@ bool Conv2dOperationProfiler::verify_with_device_reference_(
 
     // device reference has only one instances in Conv2dOperationVectorMap
     library::Operation const *reference_op = cc_it->second[0];
-  
+
     //
     // Initialize device reference operation
     //
@@ -1166,9 +1185,9 @@ bool Conv2dOperationProfiler::verify_with_device_reference_(
     );
 
     // Save workspace if incorrect
-    if (options.verification.save_workspace == SaveWorkspace::kIncorrect && 
+    if (options.verification.save_workspace == SaveWorkspace::kIncorrect &&
       results_.back().verification_map[library::Provider::kReferenceDevice] == Disposition::kIncorrect) {
-  
+
       save_workspace(
         device_context,
         options,
@@ -1183,14 +1202,14 @@ bool Conv2dOperationProfiler::verify_with_device_reference_(
 
 /// Measures performance results
 bool Conv2dOperationProfiler::profile(
-  Options const &options,  
+  Options const &options,
   PerformanceReport &report,
   DeviceContext &device_context,
   library::Operation const *operation,
   ProblemSpace const &problem_space,
   ProblemSpace::Problem const &problem) {
 
-  
+
   if (options.profiling.provider_enabled(library::Provider::kCUTLASS)) {
 
     // Initialize structure containing Conv2d arguments
@@ -1242,7 +1261,7 @@ Status Conv2dOperationProfiler::profile_cutlass_(
   GpuTimer timer;
 
   // initialize conv2d underlying operation to handle parallel reduction
-  library::Operation const* underlying_operation = operation; 
+  library::Operation const* underlying_operation = operation;
 
   library::ConvArguments *conv_arguments = static_cast<library::ConvArguments *>(arguments);
 
@@ -1274,7 +1293,7 @@ Status Conv2dOperationProfiler::profile_cutlass_(
     conv_arguments->B = conv_workspace_.B->batch_data(problem_idx);
     conv_arguments->C = conv_workspace_.C->batch_data(problem_idx);
     conv_arguments->D = conv_workspace_.Computed->batch_data(problem_idx);
-    
+
     if (conv_workspace_.configuration.split_k_mode == conv::SplitKMode::kParallel) {
       // update library::ConvArguments for parallel split-k reduction
       conv_arguments->D = conv_workspace_.device_workspace.data();
@@ -1304,7 +1323,7 @@ Status Conv2dOperationProfiler::profile_cutlass_(
       return status;
     }
   }
-  
+
   //
   // Initialize GPU timer
   //
@@ -1319,7 +1338,7 @@ Status Conv2dOperationProfiler::profile_cutlass_(
 
   int iteration = 0;
   for (; iteration < Iterations; ++iteration) {
-    
+
     // Setup rotating workspace
     int problem_idx = (iteration % conv_workspace_.problem_count);
 
@@ -1345,7 +1364,7 @@ Status Conv2dOperationProfiler::profile_cutlass_(
       device_workspace);
 
     // Run parallel reduction kernel for parallel split_k_mode
-    if (conv_workspace_.configuration.split_k_mode == conv::SplitKMode::kParallel) {      
+    if (conv_workspace_.configuration.split_k_mode == conv::SplitKMode::kParallel) {
 
       status = reduction_op_->run(
         &conv_workspace_.reduction_arguments,
@@ -1367,7 +1386,7 @@ Status Conv2dOperationProfiler::profile_cutlass_(
   //
   // Update performance result
   //
-  
+
   runtime = timer.duration(iteration);
 
   return status;
@@ -1378,13 +1397,13 @@ Status Conv2dOperationProfiler::profile_cutlass_(
 
 /// Verifies CUTLASS against cudnn reference
 bool Conv2dOperationProfiler::verify_with_cudnn_(
-  Options const &options,  
+  Options const &options,
   PerformanceReport &report,
   DeviceContext &device_context,
   library::Operation const *operation,
   ProblemSpace const &problem_space,
   ProblemSpace::Problem const &problem) {
-  
+
   auto &conv_desc = static_cast<library::ConvDescription const &>(operation->description());
 
   //
@@ -1395,7 +1414,7 @@ bool Conv2dOperationProfiler::verify_with_cudnn_(
   cudnnStatus_t status = handle.get_cudnn_create_status();
 
   if (status != CUDNN_STATUS_SUCCESS) {
-    
+
     results_.back().verification_map[library::Provider::kCUDNN] = get_cutlass_disposition(status);
     return true;
   }
@@ -1411,7 +1430,7 @@ bool Conv2dOperationProfiler::verify_with_cudnn_(
   conv_workspace_.arguments.alpha = problem_.alpha.data();
   conv_workspace_.arguments.beta = problem_.beta.data();
   conv_workspace_.arguments.pointer_mode = library::ScalarPointerMode::kHost;
-      
+
   // cuDNN does not support four tensor arguments, so we copy the tensor C data into
   // tensor D.
   conv_workspace_.Reference->copy_from_device(conv_workspace_.C->data());
@@ -1423,8 +1442,8 @@ bool Conv2dOperationProfiler::verify_with_cudnn_(
     // Construct dispatcher to cudnn operator
     //
 
-    detail::cudnnConvDispatcher conv_op( 
-      conv_desc, 
+    detail::cudnnConvDispatcher conv_op(
+      conv_desc,
       conv_workspace_.configuration,
       conv_workspace_.arguments,
       handle
@@ -1462,7 +1481,7 @@ bool Conv2dOperationProfiler::verify_with_cudnn_(
     );
 
     // Save workspace if incorrect
-    if (options.verification.save_workspace == SaveWorkspace::kIncorrect && 
+    if (options.verification.save_workspace == SaveWorkspace::kIncorrect &&
       results_.back().verification_map[library::Provider::kCUDNN] == Disposition::kIncorrect) {
 
       save_workspace(
diff --git a/tools/profiler/src/conv3d_operation_profiler.cu b/tools/profiler/src/conv3d_operation_profiler.cu
index 99cf1af82a..8e8f5873d5 100644
--- a/tools/profiler/src/conv3d_operation_profiler.cu
+++ b/tools/profiler/src/conv3d_operation_profiler.cu
@@ -52,10 +52,10 @@ namespace profiler {
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
 /// Ctor
-Conv3dOperationProfiler::Conv3dOperationProfiler(Options const &options): 
+Conv3dOperationProfiler::Conv3dOperationProfiler(Options const &options):
   OperationProfiler(
     options,
-    library::OperationKind::kConv3d, 
+    library::OperationKind::kConv3d,
     {
       {ArgumentTypeID::kEnumerated, {"conv_kind"}, "Convolutional operator (fprop, dgrad, wgrad)"},
       {ArgumentTypeID::kInteger, {"n", "input_n"}, "Input N dimension of the Conv3d problem space"},
@@ -170,7 +170,7 @@ int64_t Conv3dOperationProfiler::Conv3dProblem::flops(
 
   int64_t flops_mainloop_ = int64_t(mnk.m()) * mnk.n() * mnk.k() * 2;
   int64_t flops_epilogue_ = int64_t(mnk.m()) * int64_t(mnk.n()) * 2;
-  
+
   // Adjust mainloop flop for dgrad strided
   if (operation_desc.conv_kind == library::ConvKind::kDgrad) {
     flops_mainloop_ = flops_mainloop_ / ( stride_d * stride_h * stride_w);
@@ -183,14 +183,14 @@ int64_t Conv3dOperationProfiler::Conv3dProblem::flops(
 
 /// Extracts the problem dimensions
 Status Conv3dOperationProfiler::initialize_configuration(
-  Options const &options,  
+  Options const &options,
   PerformanceReport &report,
   DeviceContext &device_context,
   library::Operation const *operation,
   ProblemSpace const &problem_space,
   ProblemSpace::Problem const &problem) {
 
-  library::ConvDescription const &operation_desc = 
+  library::ConvDescription const &operation_desc =
     static_cast<library::ConvDescription const &>(operation->description());
 
   if (!arg_as_int(problem_.n, "n", problem_space, problem)) {
@@ -207,7 +207,7 @@ Status Conv3dOperationProfiler::initialize_configuration(
     // default value
     problem_.h = 14;
   }
-  
+
   if (!arg_as_int(problem_.w, "w", problem_space, problem)) {
     // default value
     problem_.w = 14;
@@ -232,7 +232,7 @@ Status Conv3dOperationProfiler::initialize_configuration(
     // default value
     problem_.r = 3;
   }
-  
+
   if (!arg_as_int(problem_.s, "s", problem_space, problem)) {
     // default value
     problem_.s = 3;
@@ -294,25 +294,25 @@ Status Conv3dOperationProfiler::initialize_configuration(
   // cutlass profiler sets p and q which are cuDNN compliant.                           //
   //                                                                                    //
   ////////////////////////////////////////////////////////////////////////////////////////
-  // set convolution output z 
+  // set convolution output z
   if (!arg_as_int(problem_.z, "z", problem_space, problem)) {
     // default value (set using cudnn formula for output height, when p is not provided)
     problem_.z = (
-                    problem_.d + 
-                    2 * problem_.pad_d - 
+                    problem_.d +
+                    2 * problem_.pad_d -
                     ((problem_.t - 1) * problem_.dilation_d + 1)
-                 ) / (problem_.stride_d) 
+                 ) / (problem_.stride_d)
                 + 1;
   }
 
-  // set convolution output p 
+  // set convolution output p
   if (!arg_as_int(problem_.p, "p", problem_space, problem)) {
     // default value (set using cudnn formula for output height, when p is not provided)
     problem_.p = (
-                    problem_.h + 
-                    2 * problem_.pad_h - 
+                    problem_.h +
+                    2 * problem_.pad_h -
                     ((problem_.r - 1) * problem_.dilation_h + 1)
-                 ) / (problem_.stride_h) 
+                 ) / (problem_.stride_h)
                 + 1;
   }
 
@@ -320,10 +320,10 @@ Status Conv3dOperationProfiler::initialize_configuration(
   if (!arg_as_int(problem_.q, "q", problem_space, problem)) {
     // default value (set using cudnn formula for output width, when q is not provided)
     problem_.q = (
-                    problem_.w + 
-                    2 * problem_.pad_w - 
+                    problem_.w +
+                    2 * problem_.pad_w -
                     ((problem_.s - 1) * problem_.dilation_w + 1)
-                 ) / (problem_.stride_w) 
+                 ) / (problem_.stride_w)
                 + 1;
   }
   /////////////////////////////////////////////////////////////////////////////////////////
@@ -338,7 +338,7 @@ Status Conv3dOperationProfiler::initialize_configuration(
     // default value
     problem_.split_k_slices = 1;
   }
-  
+
   if (!arg_as_ConvModeID(problem_.conv_mode, "conv_mode", problem_space, problem)) {
     // default value
     problem_.conv_mode = library::ConvModeID::kCrossCorrelation;
@@ -370,24 +370,24 @@ Status Conv3dOperationProfiler::initialize_configuration(
   }
 
   if (!arg_as_scalar(
-    problem_.alpha, 
-    operation_desc.element_epilogue, 
-    "alpha", 
-    problem_space, 
+    problem_.alpha,
+    operation_desc.element_epilogue,
+    "alpha",
+    problem_space,
     problem)) {
 
     if (!cast_from_double(problem_.alpha, operation_desc.element_epilogue, 1)) {
       return Status::kErrorInternal;
     }
   }
-  
+
   if (!arg_as_scalar(
-    problem_.beta, 
-    operation_desc.element_epilogue, 
-    "beta", 
-    problem_space, 
+    problem_.beta,
+    operation_desc.element_epilogue,
+    "beta",
+    problem_space,
     problem)) {
-    
+
     if (!cast_from_double(problem_.beta, operation_desc.element_epilogue, 0)) {
       return Status::kErrorInternal;
     }
@@ -420,25 +420,25 @@ Status Conv3dOperationProfiler::initialize_configuration(
                                                 int(problem_.split_k_slices),
                                                 1 // groups
                                               );
-  
+
   conv_workspace_.configuration.split_k_mode = static_cast<conv::SplitKMode>(static_cast<int>(problem_.split_k_mode));
 
   conv_workspace_.configuration.layout_activations.stride() = make_Coord(
-    int(problem_.c), 
+    int(problem_.c),
     int(problem_.w) * int(problem_.c),
     int(problem_.h) * int(problem_.w) * int(problem_.c),
     int(problem_.d) * int(problem_.h) * int(problem_.w) * int(problem_.c)
   );
 
   conv_workspace_.configuration.layout_filters.stride() = make_Coord(
-    int(problem_.c), 
+    int(problem_.c),
     int(problem_.s) * int(problem_.c),
     int(problem_.r) * int(problem_.s) * int(problem_.c),
     int(problem_.t) * int(problem_.r) * int(problem_.s) * int(problem_.c)
   );
 
   conv_workspace_.configuration.layout_output.stride() = make_Coord(
-    int(problem_.k), 
+    int(problem_.k),
     int(problem_.q) * int(problem_.k),
     int(problem_.q) * int(problem_.p) * int(problem_.k),
     int(problem_.z) * int(problem_.q) * int(problem_.p) * int(problem_.k)
@@ -469,7 +469,7 @@ Status Conv3dOperationProfiler::initialize_configuration(
 /// Initializes the performance result
 void Conv3dOperationProfiler::initialize_result_(
   PerformanceResult &result,
-  Options const &options,  
+  Options const &options,
   library::ConvDescription const &operation_desc,
   ProblemSpace const &problem_space) {
 
@@ -481,15 +481,15 @@ void Conv3dOperationProfiler::initialize_result_(
   result.arguments.resize(problem_space.rank());
 
   set_argument(result, "Activation", problem_space,
-    std::string(library::to_string(operation_desc.activation().element)) 
+    std::string(library::to_string(operation_desc.activation().element))
     + ":" + library::to_string(operation_desc.activation().layout));
 
   set_argument(result, "Filter", problem_space,
-    std::string(library::to_string(operation_desc.filter().element)) 
+    std::string(library::to_string(operation_desc.filter().element))
     + ":" + library::to_string(operation_desc.filter().layout));
 
   set_argument(result, "Output", problem_space,
-    std::string(library::to_string(operation_desc.output().element)) 
+    std::string(library::to_string(operation_desc.output().element))
     + ":" + library::to_string(operation_desc.output().layout));
 
   set_argument(result, "conv_kind", problem_space, library::to_string(operation_desc.conv_kind));
@@ -506,7 +506,7 @@ void Conv3dOperationProfiler::initialize_result_(
   set_argument(result, "t", problem_space, problem_.t);
   set_argument(result, "r", problem_space, problem_.r);
   set_argument(result, "s", problem_space, problem_.s);
-  
+
   set_argument(result, "z", problem_space, problem_.z);
   set_argument(result, "p", problem_space, problem_.p);
   set_argument(result, "q", problem_space, problem_.q);
@@ -523,11 +523,11 @@ void Conv3dOperationProfiler::initialize_result_(
   set_argument(result, "dilation_h", problem_space, problem_.dilation_h);
   set_argument(result, "dilation_w", problem_space, problem_.dilation_w);
 
-  set_argument(result, "split_k_mode", problem_space, 
+  set_argument(result, "split_k_mode", problem_space,
     std::string(library::to_string(problem_.split_k_mode)));
   set_argument(result, "split_k_slices", problem_space, problem_.split_k_slices);
 
-  set_argument(result, "conv_mode", problem_space, 
+  set_argument(result, "conv_mode", problem_space,
     std::string(library::to_string(problem_.conv_mode)));
 
   set_argument(result, "alpha", problem_space,
@@ -536,7 +536,7 @@ void Conv3dOperationProfiler::initialize_result_(
   set_argument(result, "beta", problem_space,
     library::lexical_cast(problem_.beta, operation_desc.element_epilogue));
 
-  set_argument(result, "eq_gemm_provider", problem_space, 
+  set_argument(result, "eq_gemm_provider", problem_space,
     std::string(library::to_string(problem_.eq_gemm_provider)));
 
   OperationProfiler::initialize_result_(result, operation_desc, problem_space);
@@ -554,14 +554,14 @@ void Conv3dOperationProfiler::initialize_result_(
 
 /// Initialize reduction problem dimensions and library::Operation
 bool Conv3dOperationProfiler::initialize_reduction_configuration_(
-  Options const &options,  
+  Options const &options,
   PerformanceReport &report,
   DeviceContext &device_context,
   library::Operation const *operation,
   ProblemSpace const &problem_space,
   ProblemSpace::Problem const &problem) {
 
-  library::ConvDescription const &conv_desc = 
+  library::ConvDescription const &conv_desc =
     static_cast<library::ConvDescription const &>(operation->description());
 
   library::ConvKind const &conv_kind = conv_desc.conv_kind;
@@ -585,14 +585,14 @@ bool Conv3dOperationProfiler::initialize_reduction_configuration_(
   conv_workspace_.reduction_configuration.lds              = conv_workspace_.configuration.layout_c(conv_kind).stride()[tensor_c_stride_idx];
   conv_workspace_.reduction_configuration.ldd              = conv_workspace_.configuration.layout_c(conv_kind).stride()[tensor_c_stride_idx];
 
-  // find reduction operation 
+  // find reduction operation
   library::ReductionFunctionalKey reduction_key(
     library::Provider::kCUTLASS,
-    conv_desc.tile_description.math_instruction.element_accumulator,  // element workspace 
+    conv_desc.tile_description.math_instruction.element_accumulator,  // element workspace
     conv_desc.tile_description.math_instruction.element_accumulator,  // element accumulator
     conv_desc.C.element,                                              // element output
     conv_desc.element_epilogue                                        // element compute
-  ); 
+  );
 
 #if 0// debug print to check which reduction instance is selected
     std::cout << reduction_key << "\n";
@@ -602,7 +602,7 @@ bool Conv3dOperationProfiler::initialize_reduction_configuration_(
   if(reduction_it == Singleton::get().operation_table.reduction_operations.end()) {
 
     return false;
-  }    
+  }
 
   // initialize reduction operation required for parallel split-k conv2d operator
   reduction_op_ = reduction_it->second;
@@ -614,13 +614,24 @@ bool Conv3dOperationProfiler::initialize_reduction_configuration_(
 
 /// Initializes workspace
 Status Conv3dOperationProfiler::initialize_workspace(
-  Options const &options,  
+  Options const &options,
   PerformanceReport &report,
   DeviceContext &device_context,
   library::Operation const *operation,
   ProblemSpace const &problem_space,
   ProblemSpace::Problem const &problem) {
 
+  if (options.device.devices.size() != 1) {
+    throw std::runtime_error("This operation profiler only supports a single "
+                             "device.");
+  }
+
+  cudaError_t result;
+  result = cudaSetDevice(options.device.device_id(0));
+  if (result != cudaSuccess) {
+    throw std::runtime_error("cudaSetDevice() failed.");
+  }
+
   // initialize conv2d underlying operation to handle parallel reduction
   library::Operation const* underlying_operation = operation;
 
@@ -630,15 +641,15 @@ Status Conv3dOperationProfiler::initialize_workspace(
     }
   }
 
-  library::ConvDescription const &operation_desc = 
+  library::ConvDescription const &operation_desc =
     static_cast<library::ConvDescription const &>(underlying_operation->description());
 
   // Compute the number of copies of the problem to avoid L2 camping.
   if (!options.profiling.workspace_count) {
     int64_t bytes = problem_.bytes(operation_desc);
-    if (bytes < 3 * int64_t(options.device.properties.l2CacheSize)) {
+    if (bytes < 3 * int64_t(options.device.properties[0].l2CacheSize)) {
       conv_workspace_.problem_count =
-        1 + int((3 * int64_t(options.device.properties.l2CacheSize)) / bytes);
+        1 + int((3 * int64_t(options.device.properties[0].l2CacheSize)) / bytes);
     }
     else {
       conv_workspace_.problem_count = 1;
@@ -651,7 +662,7 @@ Status Conv3dOperationProfiler::initialize_workspace(
 
   if (options.execution_mode != ExecutionMode::kDryRun) {
     int seed_shift = 0;
-    conv_workspace_.A = device_context.allocate_tensor(
+    conv_workspace_.A = device_context.allocate_and_initialize_tensor(
       options,
       "A",
       operation_desc.A.element,
@@ -659,10 +670,11 @@ Status Conv3dOperationProfiler::initialize_workspace(
       problem_.extent_a(operation_desc.conv_kind),
       conv_workspace_.stride_a(operation_desc.conv_kind),
       conv_workspace_.problem_count,
-      seed_shift++
+      seed_shift++,
+      0 // device_index
     );
 
-    conv_workspace_.B = device_context.allocate_tensor(
+    conv_workspace_.B = device_context.allocate_and_initialize_tensor(
       options,
       "B",
       operation_desc.B.element,
@@ -670,10 +682,11 @@ Status Conv3dOperationProfiler::initialize_workspace(
       problem_.extent_b(operation_desc.conv_kind),
       conv_workspace_.stride_b(operation_desc.conv_kind),
       conv_workspace_.problem_count,
-      seed_shift++
+      seed_shift++,
+      0 // device_index
     );
 
-    conv_workspace_.C = device_context.allocate_tensor(
+    conv_workspace_.C = device_context.allocate_and_initialize_tensor(
       options,
       "C",
       operation_desc.C.element,
@@ -681,27 +694,32 @@ Status Conv3dOperationProfiler::initialize_workspace(
       problem_.extent_c(operation_desc.conv_kind),
       conv_workspace_.stride_c(operation_desc.conv_kind),
       conv_workspace_.problem_count,
-      seed_shift++
+      seed_shift++,
+      0 // device_index
     );
 
     conv_workspace_.Computed = device_context.allocate_tensor(
+      options,
       "D",
       operation_desc.C.element,
       operation_desc.C.layout,
       problem_.extent_c(operation_desc.conv_kind),
       conv_workspace_.stride_c(operation_desc.conv_kind),
-      conv_workspace_.problem_count
+      conv_workspace_.problem_count,
+      0 // device_index
     );
 
     conv_workspace_.Reference = device_context.allocate_tensor(
+      options,
       "Reference",
       operation_desc.C.element,
       operation_desc.C.layout,
       problem_.extent_c(operation_desc.conv_kind),
       conv_workspace_.stride_c(operation_desc.conv_kind),
-      conv_workspace_.problem_count
+      conv_workspace_.problem_count,
+      0 // device_index
     );
-    
+
   }
 
   //
@@ -733,10 +751,10 @@ Status Conv3dOperationProfiler::initialize_workspace(
         conv_workspace_.reduction_host_workspace.resize(workspace_size, 0);
 
         status = reduction_op_->initialize(
-          &conv_workspace_.reduction_configuration, 
-          conv_workspace_.reduction_host_workspace.data(), 
+          &conv_workspace_.reduction_configuration,
+          conv_workspace_.reduction_host_workspace.data(),
           nullptr);
-        
+
         if (status != Status::kSuccess) {
           return status;
         }
@@ -763,7 +781,7 @@ Status Conv3dOperationProfiler::initialize_workspace(
 
 /// Verifies CUTLASS against references
 bool Conv3dOperationProfiler::verify_cutlass(
-  Options const &options,  
+  Options const &options,
   PerformanceReport &report,
   DeviceContext &device_context,
   library::Operation const *operation,
@@ -784,7 +802,7 @@ bool Conv3dOperationProfiler::verify_cutlass(
   set_cutlass_operator_arguments_();
 
   conv_workspace_.Computed->copy_from_device(conv_workspace_.C->data());
-  
+
   //
   // Run the CUTLASS operation
   //
@@ -799,9 +817,9 @@ bool Conv3dOperationProfiler::verify_cutlass(
   }
 
 #if 0
-  std::cout << "profiling         : " << std::endl 
-            << "conv2d            : " << operation->description().name << std::endl 
-            << "underlying conv2d : " << underlying_operation->description().name << std::endl 
+  std::cout << "profiling         : " << std::endl
+            << "conv2d            : " << operation->description().name << std::endl
+            << "underlying conv2d : " << underlying_operation->description().name << std::endl
             << "reduction         : " << reduction_op_->description().name << std::endl;
 #endif
 
@@ -818,7 +836,7 @@ bool Conv3dOperationProfiler::verify_cutlass(
 
   // Run parallel reduction kernel for parallel split_k_mode
   if (conv_workspace_.configuration.split_k_mode == conv::SplitKMode::kParallel) {
-    
+
     results_.back().status = reduction_op_->run(
       &conv_workspace_.reduction_arguments,
       conv_workspace_.reduction_host_workspace.data(),
@@ -840,7 +858,7 @@ bool Conv3dOperationProfiler::verify_cutlass(
 
   // CUTLASS op ran the but not yet verified against any verification provider
   results_.back().disposition = Disposition::kNotVerified;
-  
+
   //
   // Run verification providers
   //
@@ -856,7 +874,7 @@ bool Conv3dOperationProfiler::verify_cutlass(
 
       Status status = cudnn_satisfies(conv_desc, conv_workspace_.configuration);
 
-      // Initialize reference data to the source data 
+      // Initialize reference data to the source data
       conv_workspace_.Reference->copy_from_device(conv_workspace_.C->data());
 
       if (status == Status::kSuccess) {
@@ -883,8 +901,8 @@ bool Conv3dOperationProfiler::verify_cutlass(
 
     // Run verification host reference
     if (options.verification.provider_enabled(library::Provider::kReferenceHost)) {
-      
-      // Restore reference data back to initial source data 
+
+      // Restore reference data back to initial source data
       conv_workspace_.Reference->copy_from_device(conv_workspace_.C->data());
 
       verify_with_host_reference_(
@@ -893,10 +911,10 @@ bool Conv3dOperationProfiler::verify_cutlass(
         device_context,
         operation,
         problem_space,
-        problem);      
+        problem);
     }
 
-    // Update disposition to worst case verification outcome among all 
+    // Update disposition to worst case verification outcome among all
     // verification providers which are supported
     bool is_any_verification_run_passed = false;
     for(auto &m : results_.back().verification_map) {
@@ -921,7 +939,7 @@ bool Conv3dOperationProfiler::verify_cutlass(
 
 /// Verifies CUTLASS against host reference
 bool Conv3dOperationProfiler::verify_with_host_reference_(
-  Options const &options,  
+  Options const &options,
   PerformanceReport &report,
   DeviceContext &device_context,
   library::Operation const *operation,
@@ -939,14 +957,14 @@ bool Conv3dOperationProfiler::verify_with_host_reference_(
 
   library::ConvFunctionalKey conv_key(
     library::Provider::kReferenceHost,
-    conv_desc.conv_kind,        
+    conv_desc.conv_kind,
     conv_desc.A.element,
     conv_desc.A.layout,
     conv_desc.B.element,
     conv_desc.B.layout,
     conv_desc.C.element,
     conv_desc.C.layout,
-    conv_desc.tile_description.math_instruction.element_accumulator, 
+    conv_desc.tile_description.math_instruction.element_accumulator,
     conv_desc.element_epilogue);
 
 #if 0 // debug print to check which host reference instance is selected
@@ -959,12 +977,12 @@ bool Conv3dOperationProfiler::verify_with_host_reference_(
 
     results_.back().verification_map[library::Provider::kReferenceHost] = Disposition::kNotRun;
     return true;
-  }    
+  }
 
   // conv3d host reference minimum cc is 0 (CPU) and no iterator algorithm
   library::ConvPreferenceKey preference_key(0, library::IteratorAlgorithmID::kNone);
   auto cc_it = operators_it->second.find(preference_key);
-  
+
   if(cc_it == operators_it->second.end()) {
     results_.back().verification_map[library::Provider::kReferenceHost] = Disposition::kNotRun;
     return true;
@@ -1035,9 +1053,9 @@ bool Conv3dOperationProfiler::verify_with_host_reference_(
   );
 
   // Save workspace if incorrect
-  if (options.verification.save_workspace == SaveWorkspace::kIncorrect && 
+  if (options.verification.save_workspace == SaveWorkspace::kIncorrect &&
     results_.back().verification_map[library::Provider::kReferenceHost] == Disposition::kIncorrect) {
-  
+
     save_workspace(
       device_context,
       options,
@@ -1053,7 +1071,7 @@ bool Conv3dOperationProfiler::verify_with_host_reference_(
 
 /// Verifies CUTLASS against host reference
 bool Conv3dOperationProfiler::verify_with_device_reference_(
-  Options const &options,  
+  Options const &options,
   PerformanceReport &report,
   DeviceContext &device_context,
   library::Operation const *operation,
@@ -1068,14 +1086,14 @@ bool Conv3dOperationProfiler::verify_with_device_reference_(
 
 /// Measures performance results
 bool Conv3dOperationProfiler::profile(
-  Options const &options,  
+  Options const &options,
   PerformanceReport &report,
   DeviceContext &device_context,
   library::Operation const *operation,
   ProblemSpace const &problem_space,
   ProblemSpace::Problem const &problem) {
 
-  
+
   if (options.profiling.provider_enabled(library::Provider::kCUTLASS)) {
 
     set_cutlass_operator_arguments_();
@@ -1180,7 +1198,7 @@ Status Conv3dOperationProfiler::profile_cutlass_(
       return status;
     }
   }
-  
+
   //
   // Initialize GPU timer
   //
@@ -1198,9 +1216,9 @@ Status Conv3dOperationProfiler::profile_cutlass_(
 
     // Setup rotating workspace
     int problem_idx = (iteration % conv_workspace_.problem_count);
- 
+
     set_cutlass_operator_arguments_(problem_idx);
- 
+
     // Run underlying conv2d operation
     status = underlying_operation->run(
       arguments,
@@ -1208,7 +1226,7 @@ Status Conv3dOperationProfiler::profile_cutlass_(
       device_workspace);
 
     // Run parallel reduction kernel for parallel split_k_mode
-    if (conv_workspace_.configuration.split_k_mode == conv::SplitKMode::kParallel) {      
+    if (conv_workspace_.configuration.split_k_mode == conv::SplitKMode::kParallel) {
       status = reduction_op_->run(
         &conv_workspace_.reduction_arguments,
         conv_workspace_.reduction_host_workspace.data(),
@@ -1229,7 +1247,7 @@ Status Conv3dOperationProfiler::profile_cutlass_(
   //
   // Update performance result
   //
-  
+
   runtime = timer.duration(iteration);
 
   return status;
@@ -1240,7 +1258,7 @@ Status Conv3dOperationProfiler::profile_cutlass_(
 
 /// Verifies CUTLASS against cudnn reference
 bool Conv3dOperationProfiler::verify_with_cudnn_(
-  Options const &options,  
+  Options const &options,
   PerformanceReport &report,
   DeviceContext &device_context,
   library::Operation const *operation,
@@ -1257,7 +1275,7 @@ bool Conv3dOperationProfiler::verify_with_cudnn_(
   cudnnStatus_t status = handle.get_cudnn_create_status();
 
   if (status != CUDNN_STATUS_SUCCESS) {
-    
+
     results_.back().verification_map[library::Provider::kCUDNN] = get_cutlass_disposition(status);
     return true;
   }
@@ -1285,8 +1303,8 @@ bool Conv3dOperationProfiler::verify_with_cudnn_(
     // Construct dispatcher to cudnn operator
     //
 
-    detail::cudnnConvDispatcher conv_op( 
-      conv_desc, 
+    detail::cudnnConvDispatcher conv_op(
+      conv_desc,
       conv_workspace_.configuration,
       conv_workspace_.arguments,
       handle
@@ -1323,7 +1341,7 @@ bool Conv3dOperationProfiler::verify_with_cudnn_(
     );
 
     // Save workspace if incorrect
-    if (options.verification.save_workspace == SaveWorkspace::kIncorrect && 
+    if (options.verification.save_workspace == SaveWorkspace::kIncorrect &&
       results_.back().verification_map[library::Provider::kCUDNN] == Disposition::kIncorrect) {
 
       save_workspace(
diff --git a/tools/profiler/src/cublas_helpers.cu b/tools/profiler/src/cublas_helpers.cu
index f5287dff1f..7467c1db24 100644
--- a/tools/profiler/src/cublas_helpers.cu
+++ b/tools/profiler/src/cublas_helpers.cu
@@ -259,6 +259,25 @@ Status cublas_satisfies(library::GemmDescription const &desc) {
     return Status::kErrorNotSupported;
   }
 
+ // Refer to https://docs.nvidia.com/cuda/cublas/#id105
+ // input type A and B FE5M2 not supported in cuBLASLt
+  if(desc.A.element == library::NumericTypeID::kFE5M2 &&
+    desc.B.element == library::NumericTypeID::kFE5M2){
+
+    return Status::kErrorNotSupported;
+  }
+
+ // Refer to https://docs.nvidia.com/cuda/cublas/#id105
+ // input type A and B are FE5M2 and FE4M3 then D type should be F32
+  if (desc.A.element == library::NumericTypeID::kFE5M2 &&
+    desc.B.element == library::NumericTypeID::kFE4M3 &&
+    desc.C.element == library::NumericTypeID::kF32 &&
+    desc.D.element != library::NumericTypeID::kF32 ){
+
+    return Status::kErrorNotSupported;
+  }
+
+
   // output type S4 and S8 not supported in cuBLAS
   if (desc.C.element == library::NumericTypeID::kS4 || 
     desc.C.element == library::NumericTypeID::kS8) {
@@ -405,7 +424,261 @@ cublasStatus_t cublasGemmExDispatcher::operator()(cublasHandle_t handle) {
   }
 }
 
-} // namespace detail
+
+cublasLtGemmExDispatcher::cublasLtGemmExDispatcher(
+  library::GemmDescription const &op_desc,
+  library::GemmUniversalConfiguration configuration_,
+  library::GemmUniversalArguments arguments_
+):
+  op_desc(op_desc), configuration(configuration_), arguments(arguments_), status(Status::kSuccess) {
+
+  bool good = true;
+
+  good = (good && get_cublas_transpose_operation(trans_A, op_desc.A.layout, op_desc.transform_A));
+  good = (good && get_cublas_transpose_operation(trans_B, op_desc.B.layout, op_desc.transform_B));
+  good = (good && get_cublas_datatype(data_type_A, op_desc.A.element));
+  good = (good && get_cublas_datatype(data_type_B, op_desc.B.element));
+  good = (good && get_cublas_datatype(data_type_C, op_desc.C.element));
+
+  good = (good && get_cublas_datatype(
+    compute_data_type,
+    op_desc.tile_description.math_instruction.element_accumulator));
+
+  // cuBLAS introduces a separate cublasComputeType enumerant to more precisely describe
+  // internal numerical data types used in the computation.
+#if (__CUDACC_VER_MAJOR__ >= 11)
+  library::OpcodeClassID const & opcode_class =
+    op_desc.tile_description.math_instruction.opcode_class;
+
+  if (good &&
+    op_desc.A.element == library::NumericTypeID::kF32 &&
+    op_desc.B.element == library::NumericTypeID::kF32 &&
+    opcode_class == library::OpcodeClassID::kTensorOp) {
+
+    compute_type = CUBLAS_COMPUTE_32F_FAST_TF32;
+  }
+  else if (good) {
+    bool const isPedantic = false;
+    switch (compute_data_type) {
+      case CUDA_R_32F:
+      case CUDA_C_32F:
+        compute_type = isPedantic ? CUBLAS_COMPUTE_32F_PEDANTIC : CUBLAS_COMPUTE_32F;
+        break;
+      case CUDA_R_64F:
+      case CUDA_C_64F:
+        compute_type = isPedantic ? CUBLAS_COMPUTE_64F_PEDANTIC : CUBLAS_COMPUTE_64F;
+        break;
+      case CUDA_R_16F:
+        compute_type = isPedantic ? CUBLAS_COMPUTE_16F_PEDANTIC : CUBLAS_COMPUTE_16F;
+        break;
+      case CUDA_R_32I:
+        compute_type = isPedantic ? CUBLAS_COMPUTE_32I_PEDANTIC : CUBLAS_COMPUTE_32I;
+        break;
+      default:
+        good = false;
+        break;
+    }
+  }
+#endif // __CUDACC_VER_MAJOR__ >= 11
+
+  if (!good) {
+    status = Status::kErrorNotSupported;
+  }
+}
+
+void cublasLtGemmExDispatcher::initialize_cublaslt(){
+
+  // create operation desciriptor; see cublasLtMatmulDescAttributes_t for details about defaults; here we just need to
+  // set the transforms for A and B
+  cublasLtMatmulDescCreate(&operationDesc, compute_type, compute_data_type);
+  cublasLtMatmulDescSetAttribute(operationDesc, CUBLASLT_MATMUL_DESC_TRANSA, &trans_A, sizeof(trans_A));
+  cublasLtMatmulDescSetAttribute(operationDesc, CUBLASLT_MATMUL_DESC_TRANSB, &trans_B, sizeof(trans_B));
+
+  uint64_t contiguous_A = (trans_A == CUBLAS_OP_N ? configuration.problem_size.m() : configuration.problem_size.k());
+  uint64_t strided_A = (trans_A == CUBLAS_OP_N ? configuration.problem_size.k() :  configuration.problem_size.m());
+  uint64_t contiguous_B = (trans_B == CUBLAS_OP_N ? configuration.problem_size.k() :  configuration.problem_size.n());
+  uint64_t strided_B = (trans_B == CUBLAS_OP_N ? configuration.problem_size.n() :  configuration.problem_size.k());
+
+  // create matrix descriptors, we are good with the details here so no need to set any extra attributes
+  // table of supported type combinations can be found in the documentation: https://docs.nvidia.com/cuda/cublas/index.html#cublasltmatmul
+  cublasLtMatrixLayoutCreate(&Adesc, data_type_A, contiguous_A, strided_A,  configuration.lda);
+  cublasLtMatrixLayoutCreate(&Bdesc, data_type_B, contiguous_B, strided_B,  configuration.ldb);
+  cublasLtMatrixLayoutCreate(&Cdesc, data_type_C, configuration.problem_size.m(), configuration.problem_size.n(), configuration.ldc);
+  cublasLtMatrixLayoutCreate(&Ddesc, data_type_C, configuration.problem_size.m(), configuration.problem_size.n(), configuration.ldd);
+
+}
+
+bool cublasLtGemmExDispatcher::get_cublaslt_algo(cublasLtHandle_t handle,
+                                 AlgorithmMode algorithm_mode
+                                 ){
+  const int requestedAlgoCount = 8; //By default gets 8 algorithms from GetHeuristic Call. CublasLt heuristics provide at max 8 algorithms. 
+  int returnedResults = 0;
+  cublasLtMatmulHeuristicResult_t heuristicResult[requestedAlgoCount] = {};
+
+#if (__CUDACC_VER_MAJOR__ >= 12)
+  //Decide based upon the unique operation identifier whether to turn on fast accum for cublas kernel or not.
+  std::string operation_name(op_desc.name);
+  if(operation_name.find("fastaccum") != std::string::npos){
+    const int8_t fastAccuMode = 1;
+    cublasLtMatmulDescSetAttribute(operationDesc,
+        CUBLASLT_MATMUL_DESC_FAST_ACCUM,
+        &fastAccuMode,
+        sizeof(fastAccuMode));
+  }
+#endif // __CUDACC_VER_MAJOR__ >= 12
+
+  //Using 32MB for hopper kernel. This is the max workspace size for the call to cublasLtMatmulAlgoGetHeuristic()
+  size_t workspaceSizeForHeuristics = 32ULL * 1024 * 1024;
+  void* workspaceHeuristic = nullptr;
+
+  cudaError_t result = cudaMalloc((void **)&workspaceHeuristic, workspaceSizeForHeuristics);
+  if (result != cudaSuccess) {
+    throw std::bad_alloc();
+  }
+
+  // create preference handle; here we could use extra attributes to disable tensor ops or to make sure algo selected
+  // will work with badly aligned A, B, C; here for simplicity we just assume A,B,C are always well aligned (e.g.
+  // directly come from cudaMalloc)
+  cublasLtMatmulPreferenceCreate(&preference);
+  cublasLtMatmulPreferenceSetAttribute(preference, CUBLASLT_MATMUL_PREF_MAX_WORKSPACE_BYTES, &workspaceSizeForHeuristics, sizeof(workspaceSizeForHeuristics));
+
+  cublasLtMatmulAlgoGetHeuristic(handle, operationDesc, Adesc, Bdesc, Cdesc, Ddesc, preference, requestedAlgoCount, heuristicResult, &returnedResults);
+
+  if (returnedResults == 0) {
+    return false;
+  }
+
+  int bestAlgoIdx = 0;
+  //
+  //Auto Tuning to get the best kernel for the given problem
+  //
+  if (algorithm_mode == AlgorithmMode::kBest) {
+    float time = 0;
+    float bestAlgoTime = 0;
+    cudaStream_t stream;
+    cudaEvent_t startEvent, stopEvent;
+    
+    cudaStreamCreate(&stream);
+    cudaEventCreate(&startEvent);
+    cudaEventCreate(&stopEvent);
+      
+    constexpr int repeatAlgoCheck = 5;
+    std::vector<float> algoTimes(repeatAlgoCheck);
+    
+    for (int algoIdx = 0; algoIdx < returnedResults; algoIdx++) {
+      for (int checkIdx = 0; checkIdx < repeatAlgoCheck; checkIdx++) {
+        cudaEventRecord(startEvent, stream);
+  
+        cublasStatus_t status = cublasLtMatmul(handle,
+                 operationDesc,
+                 arguments.alpha,
+                 arguments.A,
+                 Adesc,
+                 arguments.B,
+                 Bdesc,
+                 arguments.beta,
+                 arguments.C,
+                 Cdesc,
+                 arguments.D,
+                 Ddesc,
+                 &heuristicResult[algoIdx].algo,
+                 workspaceHeuristic,
+                 heuristicResult[algoIdx].workspaceSize,
+                 stream);
+  
+        // Handle errors
+        if (status != CUBLAS_STATUS_SUCCESS) {
+          std::cerr << "cublasLtMatmul AutoTuning failed with status: " << cublasLtGetStatusName(status) << std::endl;
+          return false;
+        }
+  
+        cudaEventRecord(stopEvent, stream);
+        cudaEventSynchronize(stopEvent);
+        cudaEventElapsedTime(&time, startEvent, stopEvent);
+        algoTimes[checkIdx] = time;
+  
+      }
+  
+      const size_t size = algoTimes.size();
+      if (size == 0) {
+        time = 0;
+      }
+    
+      std::sort(algoTimes.begin(), algoTimes.end());
+    
+      const size_t mid = size / 2;
+      if (size % 2 == 0) {
+        time = (algoTimes[mid] + algoTimes[mid - 1]) / 2;
+      }
+      else {
+        time = algoTimes[mid];
+      }
+    
+      if (algoIdx == 0 || time < bestAlgoTime) {
+        bestAlgoTime = time;
+        bestAlgoIdx = algoIdx;
+      }
+    }
+  
+
+#if defined(CUTLASS_DEBUG_TRACE_LEVEL) && (CUTLASS_DEBUG_TRACE_LEVEL > 1)
+    std::cout << "\n";
+    std::cout << "# Algorithms checked: " << returnedResults << "\n";
+    std::cout << "WorkspaceSize Allocated: " << heuristicResult[bestAlgoIdx].workspaceSize << "\n";
+    std::cout << "Algorithm selected after auto-tuning is:" << "\n";
+    
+    int algoId, tile, swizzle, customOption, numSplitsK, reductionScheme;
+  
+    cublasLtMatmulAlgoConfigGetAttribute(&heuristicResult[bestAlgoIdx].algo, CUBLASLT_ALGO_CONFIG_ID, &algoId, sizeof(algoId), NULL);
+    cublasLtMatmulAlgoConfigGetAttribute(&heuristicResult[bestAlgoIdx].algo, CUBLASLT_ALGO_CONFIG_TILE_ID, &tile, sizeof(tile), NULL);
+    cublasLtMatmulAlgoConfigGetAttribute(&heuristicResult[bestAlgoIdx].algo, CUBLASLT_ALGO_CONFIG_SPLITK_NUM, &numSplitsK, sizeof(numSplitsK), NULL);
+    cublasLtMatmulAlgoConfigGetAttribute(&heuristicResult[bestAlgoIdx].algo, CUBLASLT_ALGO_CONFIG_REDUCTION_SCHEME, &reductionScheme, sizeof(reductionScheme), NULL);
+    cublasLtMatmulAlgoConfigGetAttribute(&heuristicResult[bestAlgoIdx].algo, CUBLASLT_ALGO_CONFIG_CTA_SWIZZLING, &swizzle, sizeof(swizzle), NULL);
+    cublasLtMatmulAlgoConfigGetAttribute(&heuristicResult[bestAlgoIdx].algo, CUBLASLT_ALGO_CONFIG_CUSTOM_OPTION, &customOption, sizeof(customOption), NULL);
+  
+    printf("algo={ Id=%d, tileIdx=%d splitK=%d reduc=%d swizzle=%d custom=%d }\n",
+        algoId, tile, numSplitsK, reductionScheme, swizzle, customOption);
+#endif
+
+    if (stream) cudaStreamDestroy(stream);
+    if (startEvent) cudaEventDestroy(startEvent);
+    if (stopEvent) cudaEventDestroy(stopEvent);
+
+  }
+
+  //setting algorithm for the dispatcher
+  heuristicResult_ = heuristicResult[bestAlgoIdx];
+  result = cudaMalloc((void **)&workspace, heuristicResult_.workspaceSize);
+  if (result != cudaSuccess) {
+    throw std::bad_alloc();
+  }
+  
+  return true;
+}
+
+cublasStatus_t cublasLtGemmExDispatcher::operator()(cublasLtHandle_t handle) 
+{
+  return cublasLtMatmul(handle,
+    operationDesc,
+    arguments.alpha,
+    arguments.A,
+    Adesc,
+    arguments.B,
+    Bdesc,
+    arguments.beta,
+    arguments.C,
+    Cdesc,
+    arguments.D,
+    Ddesc,
+    &heuristicResult_.algo,
+    workspace,
+    heuristicResult_.workspaceSize,
+    0); //number of streams is set to 0
+  
+}
+
+}
+// namespace detail
 
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
diff --git a/tools/profiler/src/cutlass_profiler.cu b/tools/profiler/src/cutlass_profiler.cu
index ebdaf66e2e..d9aae93255 100644
--- a/tools/profiler/src/cutlass_profiler.cu
+++ b/tools/profiler/src/cutlass_profiler.cu
@@ -208,19 +208,6 @@ void CutlassProfiler::print_options_(std::ostream &out) {
 
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
-/// Initializes the CUDA device
-void CutlassProfiler::initialize_device_() {
-
-  cudaError_t result = cudaSetDevice(options_.device.device);
-
-  if (result != cudaSuccess) {
-    std::cerr << "Failed to set device.";
-    throw std::runtime_error("Failed to set device");
-  }
-}
-
-/////////////////////////////////////////////////////////////////////////////////////////////////
-
 } // namespace profiler
 } // namespace cutlass
 
diff --git a/tools/profiler/src/device_allocation.cu b/tools/profiler/src/device_allocation.cu
index 1bebb053e5..4e57244e95 100644
--- a/tools/profiler/src/device_allocation.cu
+++ b/tools/profiler/src/device_allocation.cu
@@ -88,16 +88,16 @@ static std::vector<int64_t> get_packed_layout_stride(std::vector<int> const &ext
 
 /// Returns the stride of a packed layout
 std::vector<int64_t> DeviceAllocation::get_packed_layout(
-  library::LayoutTypeID layout_id, 
+  library::LayoutTypeID layout_id,
   std::vector<int> const &extent) {
 
   std::vector<int64_t> stride;
 
   switch (layout_id) {
-    case library::LayoutTypeID::kColumnMajor: 
+    case library::LayoutTypeID::kColumnMajor:
       stride = get_packed_layout_stride<cutlass::layout::ColumnMajor>(extent);
       break;
-    case library::LayoutTypeID::kRowMajor: 
+    case library::LayoutTypeID::kRowMajor:
       stride = get_packed_layout_stride<cutlass::layout::RowMajor>(extent);
       break;
     case library::LayoutTypeID::kColumnMajorInterleavedK2:
@@ -159,7 +159,7 @@ std::vector<int64_t> DeviceAllocation::get_packed_layout(
 
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
-/// Template to use CUTLASS Layout functions to 
+/// Template to use CUTLASS Layout functions to
 template <typename Layout>
 static size_t construct_layout_(
   void *bytes,
@@ -177,8 +177,8 @@ static size_t construct_layout_(
     stride = get_packed_layout_stride<Layout>(extent);
 
     return construct_layout_<Layout>(
-      bytes, 
-      layout_id, 
+      bytes,
+      layout_id,
       extent,
       stride);
   }
@@ -202,7 +202,7 @@ static size_t construct_layout_(
 
   // Pack it into bytes
   if (bytes) {
-    *reinterpret_cast<Layout *>(bytes) = layout; 
+    *reinterpret_cast<Layout *>(bytes) = layout;
   }
 
   // Return capacity
@@ -219,10 +219,10 @@ size_t DeviceAllocation::construct_layout(
   std::vector<int64_t> &stride) {
 
   switch (layout_id) {
-    case library::LayoutTypeID::kColumnMajor: 
+    case library::LayoutTypeID::kColumnMajor:
       return construct_layout_<cutlass::layout::ColumnMajor>(bytes, layout_id, extent, stride);
-      
-    case library::LayoutTypeID::kRowMajor: 
+
+    case library::LayoutTypeID::kRowMajor:
       return construct_layout_<cutlass::layout::RowMajor>(bytes, layout_id, extent, stride);
 
     case library::LayoutTypeID::kColumnMajorInterleavedK2:
@@ -284,24 +284,26 @@ size_t DeviceAllocation::construct_layout(
 
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
-DeviceAllocation::DeviceAllocation(): 
-  type_(library::NumericTypeID::kInvalid), 
+DeviceAllocation::DeviceAllocation():
+  type_(library::NumericTypeID::kInvalid),
   batch_stride_(0),
-  capacity_(0), 
+  capacity_(0),
   pointer_(nullptr),
   layout_(library::LayoutTypeID::kUnknown),
-  batch_count_(1) {
+  batch_count_(1),
+  device_(-1) {
 
 }
 
 DeviceAllocation::DeviceAllocation(
-  library::NumericTypeID type, 
-  size_t capacity
+  library::NumericTypeID type,
+  size_t capacity,
+  int device
 ):
-  type_(type), batch_stride_(capacity), capacity_(capacity), pointer_(nullptr), 
-  layout_(library::LayoutTypeID::kUnknown), batch_count_(1) {
+  type_(type), batch_stride_(capacity), capacity_(capacity), pointer_(nullptr),
+  layout_(library::LayoutTypeID::kUnknown), batch_count_(1), device_(device) {
 
-  cudaError_t result = cudaMalloc((void **)&pointer_, bytes(type, capacity));
+  cudaError_t result = this->malloc((void **)&pointer_, bytes(type, capacity));
 
   if (result != cudaSuccess) {
     type_ = library::NumericTypeID::kInvalid;
@@ -312,13 +314,15 @@ DeviceAllocation::DeviceAllocation(
 }
 
 DeviceAllocation::DeviceAllocation(
-  library::NumericTypeID type, 
-  library::LayoutTypeID layout_id, 
-  std::vector<int> const &extent, 
+  library::NumericTypeID type,
+  library::LayoutTypeID layout_id,
+  std::vector<int> const &extent,
   std::vector<int64_t> const &stride,
-  int batch_count
+  int batch_count,
+  int device
 ):
-  type_(type), batch_stride_(size_t(0)), capacity_(size_t(0)), pointer_(nullptr), batch_count_(1) {
+  type_(type), batch_stride_(size_t(0)), capacity_(size_t(0)),
+  pointer_(nullptr), batch_count_(1), device_(device) {
 
   reset(type, layout_id, extent, stride, batch_count);
 }
@@ -355,7 +359,7 @@ DeviceAllocation &DeviceAllocation::reset(library::NumericTypeID type, size_t ca
   batch_stride_ = capacity;
   capacity_ = capacity;
 
-  cudaError_t result = cudaMalloc((void **)&pointer_, bytes(type_, capacity_));
+  cudaError_t result = this->malloc((void **)&pointer_, bytes(type_, capacity_));
   if (result != cudaSuccess) {
     throw std::bad_alloc();
   }
@@ -373,9 +377,9 @@ DeviceAllocation &DeviceAllocation::reset(library::NumericTypeID type, size_t ca
 
 /// Allocates memory for a given layout and tensor
 DeviceAllocation &DeviceAllocation::reset(
-  library::NumericTypeID type, 
-  library::LayoutTypeID layout_id, 
-  std::vector<int> const &extent, 
+  library::NumericTypeID type,
+  library::LayoutTypeID layout_id,
+  std::vector<int> const &extent,
   std::vector<int64_t> const &stride,
   int batch_count) {
 
@@ -391,14 +395,14 @@ DeviceAllocation &DeviceAllocation::reset(
   batch_count_ = batch_count;
 
   batch_stride_ = construct_layout(
-    tensor_ref_buffer_.data() + sizeof(pointer_), 
-    layout_id, 
-    extent, 
+    tensor_ref_buffer_.data() + sizeof(pointer_),
+    layout_id,
+    extent,
     stride_);
 
   capacity_ = batch_stride_ * batch_count_;
 
-  cudaError_t result = cudaMalloc((void **)&pointer_, bytes(type, capacity_));
+  cudaError_t result = this->malloc((void **)&pointer_, bytes(type, capacity_));
   if (result != cudaSuccess) {
     throw std::bad_alloc();
   }
@@ -421,7 +425,7 @@ void *DeviceAllocation::data() const {
 }
 
 void *DeviceAllocation::batch_data(int batch_idx) const {
-    return static_cast<char *>(data()) + batch_stride_bytes() * batch_idx; 
+    return static_cast<char *>(data()) + batch_stride_bytes() * batch_idx;
 }
 
 library::LayoutTypeID DeviceAllocation::layout() const {
@@ -1476,159 +1480,159 @@ void DeviceAllocation::initialize_random_sparsemeta_host(int seed, int MetaSizeI
 
 /// Returns true if two blocks have exactly the same value
 bool DeviceAllocation::block_compare_equal(
-  library::NumericTypeID numeric_type, 
-  void const *ptr_A, 
-  void const *ptr_B, 
+  library::NumericTypeID numeric_type,
+  void const *ptr_A,
+  void const *ptr_B,
   size_t capacity) {
 
   switch (numeric_type) {
   case library::NumericTypeID::kFE4M3:
     return reference::device::BlockCompareEqual<float_e4m3_t>(
-      reinterpret_cast<float_e4m3_t const *>(ptr_A), 
-      reinterpret_cast<float_e4m3_t const *>(ptr_B), 
+      reinterpret_cast<float_e4m3_t const *>(ptr_A),
+      reinterpret_cast<float_e4m3_t const *>(ptr_B),
       capacity);
-    
+
   case library::NumericTypeID::kFE5M2:
     return reference::device::BlockCompareEqual<float_e5m2_t>(
       reinterpret_cast<float_e5m2_t const *>(ptr_A),
-      reinterpret_cast<float_e5m2_t const *>(ptr_B), 
+      reinterpret_cast<float_e5m2_t const *>(ptr_B),
       capacity);
   case library::NumericTypeID::kF16:
     return reference::device::BlockCompareEqual<half_t>(
-      reinterpret_cast<half_t const *>(ptr_A), 
-      reinterpret_cast<half_t const *>(ptr_B), 
+      reinterpret_cast<half_t const *>(ptr_A),
+      reinterpret_cast<half_t const *>(ptr_B),
       capacity);
-    
+
   case library::NumericTypeID::kBF16:
     return reference::device::BlockCompareEqual<bfloat16_t>(
-      reinterpret_cast<bfloat16_t const *>(ptr_A), 
-      reinterpret_cast<bfloat16_t const *>(ptr_B), 
+      reinterpret_cast<bfloat16_t const *>(ptr_A),
+      reinterpret_cast<bfloat16_t const *>(ptr_B),
       capacity);
 
   case library::NumericTypeID::kTF32:
     return reference::device::BlockCompareEqual<tfloat32_t>(
-      reinterpret_cast<tfloat32_t const *>(ptr_A), 
-      reinterpret_cast<tfloat32_t const *>(ptr_B), 
+      reinterpret_cast<tfloat32_t const *>(ptr_A),
+      reinterpret_cast<tfloat32_t const *>(ptr_B),
       capacity);
 
   case library::NumericTypeID::kF32:
     return reference::device::BlockCompareEqual<float>(
-      reinterpret_cast<float const *>(ptr_A), 
-      reinterpret_cast<float const *>(ptr_B), 
+      reinterpret_cast<float const *>(ptr_A),
+      reinterpret_cast<float const *>(ptr_B),
       capacity);
 
   case library::NumericTypeID::kCF32:
     return reference::device::BlockCompareEqual<cutlass::complex<float> >(
-      reinterpret_cast<complex<float> const *>(ptr_A), 
-      reinterpret_cast<complex<float> const *>(ptr_B), 
+      reinterpret_cast<complex<float> const *>(ptr_A),
+      reinterpret_cast<complex<float> const *>(ptr_B),
       capacity);
-  
+
   case library::NumericTypeID::kCF16:
     return reference::device::BlockCompareEqual<complex<half_t>>(
-      reinterpret_cast<complex<half_t> const *>(ptr_A), 
-      reinterpret_cast<complex<half_t> const *>(ptr_B), 
+      reinterpret_cast<complex<half_t> const *>(ptr_A),
+      reinterpret_cast<complex<half_t> const *>(ptr_B),
       capacity);
-    
+
   case library::NumericTypeID::kCBF16:
     return reference::device::BlockCompareEqual<complex<bfloat16_t>>(
-      reinterpret_cast<complex<bfloat16_t> const *>(ptr_A), 
-      reinterpret_cast<complex<bfloat16_t> const *>(ptr_B), 
+      reinterpret_cast<complex<bfloat16_t> const *>(ptr_A),
+      reinterpret_cast<complex<bfloat16_t> const *>(ptr_B),
       capacity);
 
   case library::NumericTypeID::kCTF32:
     return reference::device::BlockCompareEqual<complex<tfloat32_t>>(
-      reinterpret_cast<complex<tfloat32_t> const *>(ptr_A), 
-      reinterpret_cast<complex<tfloat32_t> const *>(ptr_B), 
+      reinterpret_cast<complex<tfloat32_t> const *>(ptr_A),
+      reinterpret_cast<complex<tfloat32_t> const *>(ptr_B),
       capacity);
-  
+
   case library::NumericTypeID::kF64:
     return reference::device::BlockCompareEqual<double>(
-      reinterpret_cast<double const *>(ptr_A), 
-      reinterpret_cast<double const *>(ptr_B), 
+      reinterpret_cast<double const *>(ptr_A),
+      reinterpret_cast<double const *>(ptr_B),
       capacity);
 
   case library::NumericTypeID::kCF64:
     return reference::device::BlockCompareEqual<complex<double>>(
-      reinterpret_cast<complex<double> const *>(ptr_A), 
-      reinterpret_cast<complex<double> const *>(ptr_B), 
+      reinterpret_cast<complex<double> const *>(ptr_A),
+      reinterpret_cast<complex<double> const *>(ptr_B),
       capacity);
-  
+
   case library::NumericTypeID::kS2:
     return reference::device::BlockCompareEqual<int2b_t>(
-      reinterpret_cast<int2b_t const *>(ptr_A), 
-      reinterpret_cast<int2b_t const *>(ptr_B), 
+      reinterpret_cast<int2b_t const *>(ptr_A),
+      reinterpret_cast<int2b_t const *>(ptr_B),
       capacity);
 
   case library::NumericTypeID::kS4:
     return reference::device::BlockCompareEqual<int4b_t>(
-      reinterpret_cast<int4b_t const *>(ptr_A), 
-      reinterpret_cast<int4b_t const *>(ptr_B), 
+      reinterpret_cast<int4b_t const *>(ptr_A),
+      reinterpret_cast<int4b_t const *>(ptr_B),
       capacity);
 
   case library::NumericTypeID::kS8:
     return reference::device::BlockCompareEqual<int8_t>(
-      reinterpret_cast<int8_t const *>(ptr_A), 
-      reinterpret_cast<int8_t const *>(ptr_B), 
+      reinterpret_cast<int8_t const *>(ptr_A),
+      reinterpret_cast<int8_t const *>(ptr_B),
       capacity);
 
   case library::NumericTypeID::kS16:
     return reference::device::BlockCompareEqual<int16_t>(
-      reinterpret_cast<int16_t const *>(ptr_A), 
-      reinterpret_cast<int16_t const *>(ptr_B), 
+      reinterpret_cast<int16_t const *>(ptr_A),
+      reinterpret_cast<int16_t const *>(ptr_B),
       capacity);
 
   case library::NumericTypeID::kS32:
     return reference::device::BlockCompareEqual<int32_t>(
-      reinterpret_cast<int32_t const *>(ptr_A), 
-      reinterpret_cast<int32_t const *>(ptr_B), 
+      reinterpret_cast<int32_t const *>(ptr_A),
+      reinterpret_cast<int32_t const *>(ptr_B),
       capacity);
 
   case library::NumericTypeID::kS64:
     return reference::device::BlockCompareEqual<int64_t>(
-      reinterpret_cast<int64_t const *>(ptr_A), 
-      reinterpret_cast<int64_t const *>(ptr_B), 
+      reinterpret_cast<int64_t const *>(ptr_A),
+      reinterpret_cast<int64_t const *>(ptr_B),
       capacity);
-  
+
   case library::NumericTypeID::kB1:
     return reference::device::BlockCompareEqual<uint1b_t>(
-      reinterpret_cast<uint1b_t const *>(ptr_A), 
-      reinterpret_cast<uint1b_t const *>(ptr_B), 
+      reinterpret_cast<uint1b_t const *>(ptr_A),
+      reinterpret_cast<uint1b_t const *>(ptr_B),
       capacity);
-  
+
   case library::NumericTypeID::kU2:
     return reference::device::BlockCompareEqual<uint2b_t>(
-      reinterpret_cast<uint2b_t const *>(ptr_A), 
-      reinterpret_cast<uint2b_t const *>(ptr_B), 
+      reinterpret_cast<uint2b_t const *>(ptr_A),
+      reinterpret_cast<uint2b_t const *>(ptr_B),
       capacity);
-  
+
   case library::NumericTypeID::kU4:
     return reference::device::BlockCompareEqual<uint4b_t>(
-      reinterpret_cast<uint4b_t const *>(ptr_A), 
-      reinterpret_cast<uint4b_t const *>(ptr_B), 
+      reinterpret_cast<uint4b_t const *>(ptr_A),
+      reinterpret_cast<uint4b_t const *>(ptr_B),
       capacity);
 
   case library::NumericTypeID::kU8:
     return reference::device::BlockCompareEqual<uint8_t>(
-      reinterpret_cast<uint8_t const *>(ptr_A), 
-      reinterpret_cast<uint8_t const *>(ptr_B), 
+      reinterpret_cast<uint8_t const *>(ptr_A),
+      reinterpret_cast<uint8_t const *>(ptr_B),
       capacity);
 
   case library::NumericTypeID::kU16:
     return reference::device::BlockCompareEqual<uint16_t>(
-      reinterpret_cast<uint16_t const *>(ptr_A), 
-      reinterpret_cast<uint16_t const *>(ptr_B), 
+      reinterpret_cast<uint16_t const *>(ptr_A),
+      reinterpret_cast<uint16_t const *>(ptr_B),
       capacity);
 
   case library::NumericTypeID::kU32:
     return reference::device::BlockCompareEqual<uint32_t>(
-      reinterpret_cast<uint32_t const *>(ptr_A), 
-      reinterpret_cast<uint32_t const *>(ptr_B), 
+      reinterpret_cast<uint32_t const *>(ptr_A),
+      reinterpret_cast<uint32_t const *>(ptr_B),
       capacity);
 
   case library::NumericTypeID::kU64:
     return reference::device::BlockCompareEqual<uint64_t>(
-      reinterpret_cast<uint64_t const *>(ptr_A), 
-      reinterpret_cast<uint64_t const *>(ptr_B), 
+      reinterpret_cast<uint64_t const *>(ptr_A),
+      reinterpret_cast<uint64_t const *>(ptr_B),
       capacity);
 
   default:
@@ -1638,9 +1642,9 @@ bool DeviceAllocation::block_compare_equal(
 
 /// Returns true if two blocks have approximately the same value
 bool DeviceAllocation::block_compare_relatively_equal(
-  library::NumericTypeID numeric_type, 
-  void const *ptr_A, 
-  void const *ptr_B, 
+  library::NumericTypeID numeric_type,
+  void const *ptr_A,
+  void const *ptr_B,
   size_t capacity,
   double epsilon,
   double nonzero_floor) {
@@ -1648,161 +1652,161 @@ bool DeviceAllocation::block_compare_relatively_equal(
   switch (numeric_type) {
   case library::NumericTypeID::kFE4M3:
     return reference::device::BlockCompareRelativelyEqual<float_e4m3_t>(
-      reinterpret_cast<float_e4m3_t const *>(ptr_A), 
+      reinterpret_cast<float_e4m3_t const *>(ptr_A),
       reinterpret_cast<float_e4m3_t const *>(ptr_B),
-      capacity, 
-      static_cast<float_e4m3_t>(epsilon), 
+      capacity,
+      static_cast<float_e4m3_t>(epsilon),
       static_cast<float_e4m3_t>(nonzero_floor));
-    
+
   case library::NumericTypeID::kFE5M2:
     return reference::device::BlockCompareRelativelyEqual<float_e5m2_t>(
-      reinterpret_cast<float_e5m2_t const *>(ptr_A), 
+      reinterpret_cast<float_e5m2_t const *>(ptr_A),
       reinterpret_cast<float_e5m2_t const *>(ptr_B),
-      capacity, 
-      static_cast<float_e5m2_t>(epsilon), 
+      capacity,
+      static_cast<float_e5m2_t>(epsilon),
       static_cast<float_e5m2_t>(nonzero_floor));
   case library::NumericTypeID::kF16:
     return reference::device::BlockCompareRelativelyEqual<half_t>(
-      reinterpret_cast<half_t const *>(ptr_A), 
+      reinterpret_cast<half_t const *>(ptr_A),
       reinterpret_cast<half_t const *>(ptr_B),
-      capacity, 
-      static_cast<half_t>(epsilon), 
+      capacity,
+      static_cast<half_t>(epsilon),
       static_cast<half_t>(nonzero_floor));
-    
+
   case library::NumericTypeID::kBF16:
     return reference::device::BlockCompareRelativelyEqual<bfloat16_t>(
-      reinterpret_cast<bfloat16_t const *>(ptr_A), 
+      reinterpret_cast<bfloat16_t const *>(ptr_A),
       reinterpret_cast<bfloat16_t const *>(ptr_B),
-      capacity, 
-      static_cast<bfloat16_t>(epsilon), 
+      capacity,
+      static_cast<bfloat16_t>(epsilon),
       static_cast<bfloat16_t>(nonzero_floor));
 
   case library::NumericTypeID::kTF32:
     return reference::device::BlockCompareRelativelyEqual<tfloat32_t>(
-      reinterpret_cast<tfloat32_t const *>(ptr_A), 
+      reinterpret_cast<tfloat32_t const *>(ptr_A),
       reinterpret_cast<tfloat32_t const *>(ptr_B),
-      capacity, 
-      static_cast<tfloat32_t>(epsilon), 
+      capacity,
+      static_cast<tfloat32_t>(epsilon),
       static_cast<tfloat32_t>(nonzero_floor));
 
   case library::NumericTypeID::kF32:
     return reference::device::BlockCompareRelativelyEqual<float>(
-      reinterpret_cast<float const *>(ptr_A), 
+      reinterpret_cast<float const *>(ptr_A),
       reinterpret_cast<float const *>(ptr_B),
-      capacity, 
-      static_cast<float>(epsilon), 
+      capacity,
+      static_cast<float>(epsilon),
       static_cast<float>(nonzero_floor));
 
   case library::NumericTypeID::kF64:
     return reference::device::BlockCompareRelativelyEqual<double>(
-      reinterpret_cast<double const *>(ptr_A), 
+      reinterpret_cast<double const *>(ptr_A),
       reinterpret_cast<double const *>(ptr_B),
-      capacity, 
-      static_cast<double>(epsilon), 
+      capacity,
+      static_cast<double>(epsilon),
       static_cast<double>(nonzero_floor));
-  
+
   case library::NumericTypeID::kS2:
     return reference::device::BlockCompareRelativelyEqual<int2b_t>(
-      reinterpret_cast<int2b_t const *>(ptr_A), 
+      reinterpret_cast<int2b_t const *>(ptr_A),
       reinterpret_cast<int2b_t const *>(ptr_B),
-      capacity, 
-      static_cast<int2b_t>(epsilon), 
+      capacity,
+      static_cast<int2b_t>(epsilon),
       static_cast<int2b_t>(nonzero_floor));
-  
+
   case library::NumericTypeID::kS4:
     return reference::device::BlockCompareRelativelyEqual<int4b_t>(
-      reinterpret_cast<int4b_t const *>(ptr_A), 
+      reinterpret_cast<int4b_t const *>(ptr_A),
       reinterpret_cast<int4b_t const *>(ptr_B),
-      capacity, 
-      static_cast<int4b_t>(epsilon), 
+      capacity,
+      static_cast<int4b_t>(epsilon),
       static_cast<int4b_t>(nonzero_floor));
 
   case library::NumericTypeID::kS8:
     return reference::device::BlockCompareRelativelyEqual<int8_t>(
-      reinterpret_cast<int8_t const *>(ptr_A), 
+      reinterpret_cast<int8_t const *>(ptr_A),
       reinterpret_cast<int8_t const *>(ptr_B),
-      capacity, 
-      static_cast<int8_t>(epsilon), 
+      capacity,
+      static_cast<int8_t>(epsilon),
       static_cast<int8_t>(nonzero_floor));
 
   case library::NumericTypeID::kS16:
     return reference::device::BlockCompareRelativelyEqual<int16_t>(
-      reinterpret_cast<int16_t const *>(ptr_A), 
+      reinterpret_cast<int16_t const *>(ptr_A),
       reinterpret_cast<int16_t const *>(ptr_B),
-      capacity, 
-      static_cast<int16_t>(epsilon), 
+      capacity,
+      static_cast<int16_t>(epsilon),
       static_cast<int16_t>(nonzero_floor));
 
   case library::NumericTypeID::kS32:
     return reference::device::BlockCompareRelativelyEqual<int32_t>(
-      reinterpret_cast<int32_t const *>(ptr_A), 
+      reinterpret_cast<int32_t const *>(ptr_A),
       reinterpret_cast<int32_t const *>(ptr_B),
-      capacity, 
-      static_cast<int32_t>(epsilon), 
+      capacity,
+      static_cast<int32_t>(epsilon),
       static_cast<int32_t>(nonzero_floor));
 
   case library::NumericTypeID::kS64:
     return reference::device::BlockCompareRelativelyEqual<int64_t>(
-      reinterpret_cast<int64_t const *>(ptr_A), 
+      reinterpret_cast<int64_t const *>(ptr_A),
       reinterpret_cast<int64_t const *>(ptr_B),
-      capacity, 
-      static_cast<int64_t>(epsilon), 
+      capacity,
+      static_cast<int64_t>(epsilon),
       static_cast<int64_t>(nonzero_floor));
-  
+
   case library::NumericTypeID::kB1:
     return reference::device::BlockCompareRelativelyEqual<uint1b_t>(
-      reinterpret_cast<uint1b_t const *>(ptr_A), 
+      reinterpret_cast<uint1b_t const *>(ptr_A),
       reinterpret_cast<uint1b_t const *>(ptr_B),
-      capacity, 
-      static_cast<uint1b_t>(epsilon), 
+      capacity,
+      static_cast<uint1b_t>(epsilon),
       static_cast<uint1b_t>(nonzero_floor));
 
   case library::NumericTypeID::kU2:
     return reference::device::BlockCompareRelativelyEqual<uint2b_t>(
-      reinterpret_cast<uint2b_t const *>(ptr_A), 
+      reinterpret_cast<uint2b_t const *>(ptr_A),
       reinterpret_cast<uint2b_t const *>(ptr_B),
-      capacity, 
-      static_cast<uint2b_t>(epsilon), 
+      capacity,
+      static_cast<uint2b_t>(epsilon),
       static_cast<uint2b_t>(nonzero_floor));
 
   case library::NumericTypeID::kU4:
     return reference::device::BlockCompareRelativelyEqual<uint4b_t>(
-      reinterpret_cast<uint4b_t const *>(ptr_A), 
+      reinterpret_cast<uint4b_t const *>(ptr_A),
       reinterpret_cast<uint4b_t const *>(ptr_B),
-      capacity, 
-      static_cast<uint4b_t>(epsilon), 
+      capacity,
+      static_cast<uint4b_t>(epsilon),
       static_cast<uint4b_t>(nonzero_floor));
 
   case library::NumericTypeID::kU8:
     return reference::device::BlockCompareRelativelyEqual<uint8_t>(
-      reinterpret_cast<uint8_t const *>(ptr_A), 
+      reinterpret_cast<uint8_t const *>(ptr_A),
       reinterpret_cast<uint8_t const *>(ptr_B),
-      capacity, 
-      static_cast<uint8_t>(epsilon), 
+      capacity,
+      static_cast<uint8_t>(epsilon),
       static_cast<uint8_t>(nonzero_floor));
 
   case library::NumericTypeID::kU16:
     return reference::device::BlockCompareRelativelyEqual<uint16_t>(
-      reinterpret_cast<uint16_t const *>(ptr_A), 
+      reinterpret_cast<uint16_t const *>(ptr_A),
       reinterpret_cast<uint16_t const *>(ptr_B),
-      capacity, 
-      static_cast<uint16_t>(epsilon), 
+      capacity,
+      static_cast<uint16_t>(epsilon),
       static_cast<uint16_t>(nonzero_floor));
 
   case library::NumericTypeID::kU32:
     return reference::device::BlockCompareRelativelyEqual<uint32_t>(
-      reinterpret_cast<uint32_t const *>(ptr_A), 
+      reinterpret_cast<uint32_t const *>(ptr_A),
       reinterpret_cast<uint32_t const *>(ptr_B),
-      capacity, 
-      static_cast<uint32_t>(epsilon), 
+      capacity,
+      static_cast<uint32_t>(epsilon),
       static_cast<uint32_t>(nonzero_floor));
 
   case library::NumericTypeID::kU64:
     return reference::device::BlockCompareRelativelyEqual<uint64_t>(
-      reinterpret_cast<uint64_t const *>(ptr_A), 
+      reinterpret_cast<uint64_t const *>(ptr_A),
       reinterpret_cast<uint64_t const *>(ptr_B),
-      capacity, 
-      static_cast<uint64_t>(epsilon), 
+      capacity,
+      static_cast<uint64_t>(epsilon),
       static_cast<uint64_t>(nonzero_floor));
 
   // No relatively equal comparison for complex numbers.
@@ -1821,7 +1825,7 @@ bool DeviceAllocation::block_compare_relatively_equal(
       reinterpret_cast<complex<float> const *>(ptr_A),
       reinterpret_cast<complex<float> const *>(ptr_B),
       capacity);
-  
+
   case library::NumericTypeID::kCF64:
     return reference::device::BlockCompareEqual<cutlass::complex<double> >(
       reinterpret_cast<complex<double> const *>(ptr_A),
@@ -1837,14 +1841,14 @@ bool DeviceAllocation::block_compare_relatively_equal(
 
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
-/// Permits copying dynamic vectors into static-length vectors 
+/// Permits copying dynamic vectors into static-length vectors
 template <typename TensorCoord, int Rank>
 struct vector_to_coord {
-  
+
   vector_to_coord(TensorCoord &coord, std::vector<int> const &vec) {
 
     coord[Rank - 1] = vec.at(Rank - 1);
-    
+
     if (Rank > 1) {
       vector_to_coord<TensorCoord, Rank - 1>(coord, vec);
     }
@@ -1853,17 +1857,17 @@ struct vector_to_coord {
   vector_to_coord(TensorCoord &coord, std::vector<int64_t> const &vec) {
 
     coord[Rank - 1] = (int)vec.at(Rank - 1);
-    
+
     if (Rank > 1) {
       vector_to_coord<TensorCoord, Rank - 1>(coord, vec);
     }
   }
 };
 
-/// Permits copying dynamic vectors into static-length vectors 
+/// Permits copying dynamic vectors into static-length vectors
 template <typename TensorCoord>
 struct vector_to_coord<TensorCoord, 1> {
-  
+
   vector_to_coord(TensorCoord &coord, std::vector<int> const &vec) {
 
     coord[0] = vec.at(0);
@@ -1875,10 +1879,10 @@ struct vector_to_coord<TensorCoord, 1> {
   }
 };
 
-/// Permits copying dynamic vectors into static-length vectors 
+/// Permits copying dynamic vectors into static-length vectors
 template <typename TensorCoord>
 struct vector_to_coord<TensorCoord, 0> {
-  
+
   vector_to_coord(TensorCoord &coord, std::vector<int> const &vec) {
 
   }
@@ -1888,7 +1892,7 @@ struct vector_to_coord<TensorCoord, 0> {
 
 template <typename Element, typename Layout>
 static void write_tensor_csv_static_tensor_view(
-  std::ostream &out, 
+  std::ostream &out,
   DeviceAllocation &allocation) {
 
   Coord<Layout::kRank> extent;
@@ -1903,7 +1907,7 @@ static void write_tensor_csv_static_tensor_view(
   }
 
   vector_to_coord<Coord<Layout::kRank>, Layout::kRank>(extent, allocation.extent());
-  vector_to_coord<Coord<Layout::kStrideRank, typename Layout::Stride::Index>, 
+  vector_to_coord<Coord<Layout::kStrideRank, typename Layout::Stride::Index>,
                         Layout::kStrideRank>(stride, allocation.stride());
 
   Layout layout(stride);
@@ -1914,7 +1918,7 @@ static void write_tensor_csv_static_tensor_view(
   }
 
   host_tensor.copy_in_device_to_host(
-    static_cast<Element const *>(allocation.data()), 
+    static_cast<Element const *>(allocation.data()),
     allocation.batch_stride());
 
   TensorViewWrite(out, host_tensor.host_view());
@@ -1926,7 +1930,7 @@ static void write_tensor_csv_static_tensor_view(
 
 template <typename T>
 static void write_tensor_csv_static_type(
-  std::ostream &out, 
+  std::ostream &out,
   DeviceAllocation &allocation) {
 
   switch (allocation.layout()) {
@@ -1991,7 +1995,7 @@ static void write_tensor_csv_static_type(
 
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
-/// Writes a tensor to csv 
+/// Writes a tensor to csv
 void DeviceAllocation::write_tensor_csv(
   std::ostream &out) {
 
@@ -1999,14 +2003,14 @@ void DeviceAllocation::write_tensor_csv(
   case library::NumericTypeID::kFE4M3:
     write_tensor_csv_static_type<float_e4m3_t>(out, *this);
     break;
-  
+
   case library::NumericTypeID::kFE5M2:
     write_tensor_csv_static_type<float_e5m2_t>(out, *this);
     break;
   case library::NumericTypeID::kF16:
     write_tensor_csv_static_type<half_t>(out, *this);
     break;
-    
+
   case library::NumericTypeID::kBF16:
     write_tensor_csv_static_type<bfloat16_t>(out, *this);
     break;
@@ -2022,7 +2026,7 @@ void DeviceAllocation::write_tensor_csv(
   case library::NumericTypeID::kF64:
     write_tensor_csv_static_type<double>(out, *this);
     break;
-  
+
   case library::NumericTypeID::kS2:
     write_tensor_csv_static_type<int2b_t>(out, *this);
     break;
@@ -2046,7 +2050,7 @@ void DeviceAllocation::write_tensor_csv(
   case library::NumericTypeID::kS64:
     write_tensor_csv_static_type<int64_t>(out, *this);
     break;
-  
+
   case library::NumericTypeID::kB1:
     write_tensor_csv_static_type<uint1b_t>(out, *this);
     break;
@@ -2074,7 +2078,7 @@ void DeviceAllocation::write_tensor_csv(
   case library::NumericTypeID::kU64:
     write_tensor_csv_static_type<uint64_t>(out, *this);
     break;
-  
+
   case library::NumericTypeID::kCF16:
     write_tensor_csv_static_type<cutlass::complex<half_t> >(out, *this);
     break;
@@ -2110,7 +2114,7 @@ static void tensor_fill_tensor_view(DeviceAllocation &allocation, Element val =
   }
 
   vector_to_coord<Coord<Layout::kRank>, Layout::kRank>(extent, allocation.extent());
-  vector_to_coord<Coord<Layout::kStrideRank, typename Layout::LongIndex>, 
+  vector_to_coord<Coord<Layout::kStrideRank, typename Layout::LongIndex>,
                         Layout::kStrideRank>(stride, allocation.stride());
 
   TensorView<Element, Layout> view(
@@ -2432,6 +2436,46 @@ void DeviceAllocation::fill_host(double val = 0.0) {
   copy_from_host(host_data.data());
 }
 
+cudaError_t DeviceAllocation::malloc(void** ptr, size_t size) {
+  cudaError_t result;
+  int set_device_back_to = -1;
+
+  /// When needed this sets the device to the allocation's device remembering
+  /// the current device so that it can be set back after the cudaMalloc is
+  /// performed.
+  if (device_ >= 0) {
+    int current_device;
+    result = cudaGetDevice(&current_device);
+    if (result != cudaSuccess) {
+      return result;
+    }
+
+    if (current_device != device_) {
+      set_device_back_to = current_device;
+      result = cudaSetDevice(device_);
+      if (result != cudaSuccess) {
+        return result;
+      }
+    }
+  }
+
+  // This performs the cudaMalloc
+  result = cudaMalloc(ptr, size);
+  if (result != cudaSuccess) {
+    return result;
+  }
+
+  /// When needed this sets the device back to what it was when the function was
+  /// called.
+  if (set_device_back_to != -1) {
+    result = cudaSetDevice(set_device_back_to);
+    if (result != cudaSuccess) {
+      return result;
+    }
+  }
+
+  return cudaSuccess;
+}
 
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
diff --git a/tools/profiler/src/device_context.cu b/tools/profiler/src/device_context.cu
index 2cbfa5d24a..eaca07b0ad 100644
--- a/tools/profiler/src/device_context.cu
+++ b/tools/profiler/src/device_context.cu
@@ -29,7 +29,7 @@
  *
  **************************************************************************************************/
 /* \file
-   \brief 
+   \brief
 */
 
 #include "cutlass/profiler/device_context.h"
@@ -41,49 +41,58 @@ namespace profiler {
 
 /// Allocates memory of a given type, capacity (elements), and name
 DeviceAllocation *DeviceContext::allocate_block(
+  Options const &options,
   std::string const &name,
-  library::NumericTypeID type, 
-  size_t capacity) {
+  library::NumericTypeID type,
+  size_t capacity,
+  size_t device_index) {
 
-  device_memory_.emplace_back(type, capacity);
+  int device = options.device.device_id(device_index);
+  device_memory_.emplace_back(type, capacity, device);
   DeviceAllocation *allocation = &device_memory_.back();
-  
+
   allocations_[name] = allocation;
   return allocation;
 }
 
 /// Allocates memory of a given type, capacity (elements), and name
 DeviceAllocation *DeviceContext::allocate_tensor(
+  Options const &options,
   std::string const &name,
-  library::NumericTypeID type, 
-  library::LayoutTypeID layout_id, 
-  std::vector<int> const &extent, 
+  library::NumericTypeID type,
+  library::LayoutTypeID layout_id,
+  std::vector<int> const &extent,
   std::vector<int64_t> const &stride,
-  int batch_count) {
+  int batch_count,
+  size_t device_index) {
 
-  device_memory_.emplace_back(type, layout_id, extent, stride, batch_count);
+  int device = options.device.device_id(device_index);
+  device_memory_.emplace_back(type, layout_id, extent, stride, batch_count,
+                              device);
   DeviceAllocation *allocation = &device_memory_.back();
-  
+
   allocations_[name] = allocation;
   return allocation;
 }
 
 /// Allocates memory of a given type, capacity (elements), and name
-DeviceAllocation *DeviceContext::allocate_tensor(
+DeviceAllocation *DeviceContext::allocate_and_initialize_tensor(
   Options const &options,
   std::string const &name,
-  library::NumericTypeID type, 
-  library::LayoutTypeID layout_id, 
-  std::vector<int> const &extent, 
+  library::NumericTypeID type,
+  library::LayoutTypeID layout_id,
+  std::vector<int> const &extent,
   std::vector<int64_t> const &stride,
   int batch_count,
-  int seed_shift) {
+  int seed_shift,
+  size_t device_index) {
 
-  DeviceAllocation *allocation = 
-    allocate_tensor(name, type, layout_id, extent, stride, batch_count);
+  DeviceAllocation *allocation =
+      allocate_tensor(options, name, type, layout_id, extent, stride,
+                      batch_count, device_index);
 
   if (options.initialization.enabled) {
-    Distribution data_distribution = options.initialization.data_distribution; 
+    Distribution data_distribution = options.initialization.data_distribution;
 
     // check if data distribution is allowed to change
     if(!options.initialization.fix_data_distribution) {
@@ -129,13 +138,13 @@ DeviceAllocation *DeviceContext::allocate_tensor(
       double stddev = data_distribution.gaussian.stddev;
       int scale = data_distribution.int_scale;
 
-      if (name == "A" && data_distribution.gaussian.pnzA != 100.0) {
+      if (name == "A" && data_distribution.gaussian.pnzA != 1.0) {
         data_distribution.set_gaussian(mean, stddev, scale, data_distribution.gaussian.pnzA);
       }
-      else if (name == "B" && data_distribution.gaussian.pnzB != 100.0) {
+      else if (name == "B" && data_distribution.gaussian.pnzB != 1.0) {
         data_distribution.set_gaussian(mean, stddev, scale, data_distribution.gaussian.pnzB);
       }
-      else if (name == "C" && data_distribution.gaussian.pnzC != 100.0) {
+      else if (name == "C" && data_distribution.gaussian.pnzC != 1.0) {
         data_distribution.set_gaussian(mean, stddev, scale, data_distribution.gaussian.pnzC);
       }
     }
@@ -147,7 +156,7 @@ DeviceAllocation *DeviceContext::allocate_tensor(
       }
       else {
         allocation->initialize_random_device(
-          options.initialization.seed + seed_shift, 
+          options.initialization.seed + seed_shift,
           data_distribution);
       }
     }
@@ -158,7 +167,7 @@ DeviceAllocation *DeviceContext::allocate_tensor(
       }
       else {
         allocation->initialize_random_host(
-          options.initialization.seed + seed_shift, 
+          options.initialization.seed + seed_shift,
           data_distribution);
       }
     }
@@ -167,20 +176,22 @@ DeviceAllocation *DeviceContext::allocate_tensor(
   return allocation;
 }
 
-/// Allocates memory for sparse meta data 
-DeviceAllocation *DeviceContext::allocate_sparsemeta_tensor(
+/// Allocates memory for sparse meta data
+DeviceAllocation *DeviceContext::allocate_and_initialize_sparsemeta_tensor(
   Options const &options,
   std::string const &name,
-  library::NumericTypeID type, 
-  library::LayoutTypeID layout_id, 
+  library::NumericTypeID type,
+  library::LayoutTypeID layout_id,
   library::NumericTypeID type_a,
-  std::vector<int> const &extent, 
+  std::vector<int> const &extent,
   std::vector<int64_t> const &stride,
   int batch_count,
-  int seed_shift) {
+  int seed_shift,
+  size_t device_index) {
 
-  DeviceAllocation *allocation = 
-    allocate_tensor(name, type, layout_id, extent, stride, batch_count);
+  DeviceAllocation *allocation =
+      allocate_tensor(options, name, type, layout_id, extent, stride,
+                      batch_count, device_index);
 
   if (options.initialization.enabled) {
     // TF32 has 4bit meta data.  The rest has 2bit.
@@ -188,12 +199,12 @@ DeviceAllocation *DeviceContext::allocate_sparsemeta_tensor(
 
     if (options.initialization.provider == library::Provider::kReferenceDevice) {
       allocation->initialize_random_sparsemeta_device(
-        options.initialization.seed + seed_shift, 
+        options.initialization.seed + seed_shift,
         MetaSizeInBits);
     }
     else if (options.initialization.provider == library::Provider::kReferenceHost) {
       allocation->initialize_random_sparsemeta_host(
-        options.initialization.seed + seed_shift, 
+        options.initialization.seed + seed_shift,
         MetaSizeInBits);
     }
   }
diff --git a/tools/profiler/src/gemm_operation_profiler.cu b/tools/profiler/src/gemm_operation_profiler.cu
index 39628d6bfa..0256f0d099 100644
--- a/tools/profiler/src/gemm_operation_profiler.cu
+++ b/tools/profiler/src/gemm_operation_profiler.cu
@@ -39,6 +39,7 @@
 #include <vector>
 
 #include "cutlass/core_io.h"
+#include <cuda_runtime_api.h>
 
 #include "cutlass/profiler/cublas_helpers.h"
 #include "cutlass/profiler/gemm_operation_profiler.h"
@@ -46,7 +47,6 @@
 #include "cutlass/library/singleton.h"
 #include "cutlass/library/library.h"
 #include "cutlass/library/handle.h"
-
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
 namespace cutlass {
@@ -485,6 +485,17 @@ Status GemmOperationProfiler::initialize_workspace(
   ProblemSpace const &problem_space,
   ProblemSpace::Problem const &problem) {
 
+  if (options.device.devices.size() != 1) {
+    throw std::runtime_error("This operation profiler only supports a single "
+                             "device.");
+  }
+
+  cudaError_t result;
+  result = cudaSetDevice(options.device.device_id(0));
+  if (result != cudaSuccess) {
+    throw std::runtime_error("cudaSetDevice() failed.");
+  }
+
   library::Operation const* underlying_operation = operation;
 
   if (problem_.split_k_mode == library::SplitKMode::kParallel) {
@@ -496,12 +507,14 @@ Status GemmOperationProfiler::initialize_workspace(
   library::GemmDescription const &operation_desc =
     static_cast<library::GemmDescription const &>(operation->description());
 
+  bool is_sparse = operation_desc.tile_description.math_instruction.opcode_class == cutlass::library::OpcodeClassID::kSparseTensorOp;
+
   // Compute the number of copies of the problem to avoid L2 camping.
   if (!options.profiling.workspace_count) {
     int64_t bytes = problem_.bytes(operation_desc);
-    if (bytes < 3 * int64_t(options.device.properties.l2CacheSize)) {
+    if (bytes < 3 * int64_t(options.device.properties[0].l2CacheSize)) {
       gemm_workspace_.problem_count =
-        1 + int((3 * int64_t(options.device.properties.l2CacheSize)) / bytes);
+        1 + int((3 * int64_t(options.device.properties[0].l2CacheSize)) / bytes);
     }
     else {
       gemm_workspace_.problem_count = 1;
@@ -514,7 +527,7 @@ Status GemmOperationProfiler::initialize_workspace(
   bool allocate_device_tensors = options.execution_mode != ExecutionMode::kDryRun;
   if (allocate_device_tensors) {
     int seed_shift = 0;
-    gemm_workspace_.A = device_context.allocate_tensor(
+    gemm_workspace_.A = device_context.allocate_and_initialize_tensor(
       options,
       "A",
       operation_desc.A.element,
@@ -522,10 +535,11 @@ Status GemmOperationProfiler::initialize_workspace(
       {int(problem_.m), int(problem_.k)},
       {int(problem_.lda)},
       problem_.batch_count * gemm_workspace_.problem_count,
-      seed_shift++
+      seed_shift++,
+      0 // device_index
     );
 
-    gemm_workspace_.B = device_context.allocate_tensor(
+    gemm_workspace_.B = device_context.allocate_and_initialize_tensor(
       options,
       "B",
       operation_desc.B.element,
@@ -533,10 +547,11 @@ Status GemmOperationProfiler::initialize_workspace(
       {int(problem_.k), int(problem_.n)},
       {int(problem_.ldb)},
       problem_.batch_count * gemm_workspace_.problem_count,
-      seed_shift++
+      seed_shift++,
+      0 // device_index
     );
 
-    gemm_workspace_.C = device_context.allocate_tensor(
+    gemm_workspace_.C = device_context.allocate_and_initialize_tensor(
       options,
       "C",
       operation_desc.C.element,
@@ -544,25 +559,30 @@ Status GemmOperationProfiler::initialize_workspace(
       {int(problem_.m), int(problem_.n)},
       {int(problem_.ldc)},
       problem_.batch_count * gemm_workspace_.problem_count,
-      seed_shift++
+      seed_shift++,
+      0 // device_index
     );
 
     gemm_workspace_.Computed = device_context.allocate_tensor(
+      options,
       "D",
       operation_desc.D.element,
       operation_desc.D.layout,
       {int(problem_.m), int(problem_.n)},
       {int(problem_.ldc)},
-      problem_.batch_count * gemm_workspace_.problem_count
+      problem_.batch_count * gemm_workspace_.problem_count,
+      0 // device_index
     );
 
     gemm_workspace_.Reference = device_context.allocate_tensor(
+      options,
       "Reference",
       operation_desc.D.element,
       operation_desc.D.layout,
       {int(problem_.m), int(problem_.n)},
       {int(problem_.ldc)},
-      problem_.batch_count * gemm_workspace_.problem_count
+      problem_.batch_count * gemm_workspace_.problem_count,
+      0 // device_index
     );
   }
 
@@ -580,7 +600,7 @@ Status GemmOperationProfiler::initialize_workspace(
     gemm_workspace_.arguments.batch_stride_D = gemm_workspace_.Computed->batch_stride();
 
     /* Query device SM count to pass onto the kernel as an argument, where needed */
-    gemm_workspace_.arguments.sm_count = options.device.properties.multiProcessorCount;
+    gemm_workspace_.arguments.sm_count = options.device.properties[0].multiProcessorCount;
   }
 
   //
@@ -596,12 +616,34 @@ Status GemmOperationProfiler::initialize_workspace(
 
       workspace_size = underlying_operation->get_device_workspace_size(&gemm_workspace_.configuration,
                                                             &gemm_workspace_.arguments);
+      if (is_sparse) {
+        // sparse gemm get_device_workspace_size() only return device workspace size per iteration
+        // Needs to multiply it w/ number of iteration
+        workspace_size *= gemm_workspace_.problem_count;
+      }
       gemm_workspace_.device_workspace.reset(library::NumericTypeID::kU8, workspace_size);
 
-      status = underlying_operation->initialize(
-        &gemm_workspace_.configuration,
-        gemm_workspace_.host_workspace.data(),
-        gemm_workspace_.device_workspace.data());
+      // Convert to structure sparse contents here.
+      if (is_sparse) {
+        uint8_t* profiler_workspaces[1];
+        profiler_workspaces[0] = reinterpret_cast<uint8_t*>(gemm_workspace_.A->data());
+        // Sparse operations have a different initialize interface.
+        // initialize_with_profiler_workspace converts mxk tensorA to compressed mxk/sp tensorA and the tensorE
+        auto modifiable_underlying_op = const_cast<library::Operation*>(underlying_operation);
+        status = modifiable_underlying_op->initialize_with_profiler_workspace(
+          &gemm_workspace_.configuration,
+          gemm_workspace_.host_workspace.data(),
+          gemm_workspace_.device_workspace.data(),
+          profiler_workspaces,
+          gemm_workspace_.problem_count);
+      }
+      else {
+        status = underlying_operation->initialize(
+          &gemm_workspace_.configuration,
+          gemm_workspace_.host_workspace.data(),
+          gemm_workspace_.device_workspace.data());
+      }
+
       if (status != Status::kSuccess) {
         return status;
       }
@@ -821,26 +863,14 @@ bool GemmOperationProfiler::verify_with_cublas_(
   // Construct cuBLAS operators
   //
 
-  CublasCreate handle;
-  cublasStatus_t status = handle.get_cublas_create_status();
+  CublasLtCreate handle;
+  cublasStatus_t status = handle.get_cublaslt_create_status();
 
   if (status != CUBLAS_STATUS_SUCCESS) {
-
     results_.back().verification_map[library::Provider::kCUBLAS] = get_cutlass_disposition(status);
     return true;
   }
 
-  std::vector<cublasGemmAlgo_t> algorithms;
-
-  detail::select_cublas_algorithms(
-    algorithms,
-    options,
-    gemm_desc);
-
-  if (algorithms.empty()) {
-    // no algorithm selected
-    return true;
-  }
 
   //
   // Initialize state
@@ -865,29 +895,34 @@ bool GemmOperationProfiler::verify_with_cublas_(
     gemm_workspace_.arguments.beta = problem_.beta.data();
     gemm_workspace_.arguments.pointer_mode = library::ScalarPointerMode::kHost;
 
-    detail::cublasGemmExDispatcher gemm_op(
+    detail::cublasLtGemmExDispatcher gemm_op(
       gemm_desc,
       gemm_workspace_.configuration,
-      gemm_workspace_.arguments,
-      algorithms.front()
+      gemm_workspace_.arguments
     );
 
+    gemm_op.initialize_cublaslt();
+
+    if(!gemm_op.get_cublaslt_algo(handle, AlgorithmMode::kDefault)){
+      return true;
+    }
+
     if (gemm_op.status != Status::kSuccess) {
       results_.back().verification_map[library::Provider::kCUBLAS] = Disposition::kNotRun;
       return true;
     }
 
-    results_.back().status = Status::kSuccess;
-
     status = gemm_op(handle);
 
     // Handle errors
     if (status != CUBLAS_STATUS_SUCCESS) {
-
+      std::cerr << "cublasLt Verification run failed with status : " << cublasLtGetStatusName(status) << "\n";
       results_.back().verification_map[library::Provider::kCUBLAS] = get_cutlass_disposition(status);
       return true;
     }
 
+    results_.back().status = Status::kSuccess;
+
     //
     // Verify results
     //
@@ -930,9 +965,9 @@ bool GemmOperationProfiler::verify_with_reference_(
   DeviceContext &device_context,
   library::Operation const *operation,
   ProblemSpace const &problem_space,
-  ProblemSpace::Problem const &problem, 
-  cutlass::library::NumericTypeID element_A, 
-  cutlass::library::NumericTypeID element_B) 
+  ProblemSpace::Problem const &problem,
+  cutlass::library::NumericTypeID element_A,
+  cutlass::library::NumericTypeID element_B)
 {
   library::GemmDescription const &gemm_desc =
     static_cast<library::GemmDescription const &>(operation->description());
diff --git a/tools/profiler/src/operation_profiler.cu b/tools/profiler/src/operation_profiler.cu
index daf73f4bfe..ce1ebb21f4 100644
--- a/tools/profiler/src/operation_profiler.cu
+++ b/tools/profiler/src/operation_profiler.cu
@@ -376,14 +376,14 @@ int OperationProfiler::profile_all(
         std::cerr << "    @ provider " << operation->description().provider
                   << " != library::Provider::kCUTLASS\n";
       }
-      if (options.device.compute_capability() < min_cc) {
+      if (options.device.compute_capability(0) < min_cc) {
         std::cerr << "    @ compute_capability "
-                  << options.device.compute_capability()
+                  << options.device.compute_capability(0)
                   << " < min_cc " << min_cc << "\n";
       }
-      if (options.device.compute_capability() > max_cc) {
+      if (options.device.compute_capability(0) > max_cc) {
         std::cerr << "    @ compute_capability "
-                  << options.device.compute_capability()
+                  << options.device.compute_capability(0)
                   << " > max_cc " << max_cc << "\n";
       }
 #endif
@@ -391,8 +391,8 @@ int OperationProfiler::profile_all(
       // Execute compatible cutlass operations if they satisfy the current device's compute capability
       if (operation->description().kind == kind_ &&
           operation->description().provider == library::Provider::kCUTLASS &&
-          options.device.compute_capability() >= min_cc &&
-          options.device.compute_capability() <= max_cc) {
+          options.device.compute_capability(0) >= min_cc &&
+          options.device.compute_capability(0) <= max_cc) {
 
         std::string operation_name(operation->description().name);
         // Filter kernels by name
diff --git a/tools/profiler/src/options.cu b/tools/profiler/src/options.cu
index e2259aa008..f1c1d7a77a 100644
--- a/tools/profiler/src/options.cu
+++ b/tools/profiler/src/options.cu
@@ -33,6 +33,7 @@
 */
 
 #include <algorithm>
+#include <set>
 
 #include "cutlass/cutlass.h"
 #include "cutlass/version.h"
@@ -55,45 +56,97 @@ static char const *end_of_line = "\n
 
 Options::Device::Device(cutlass::CommandLine const &cmdline) {
 
-  cmdline.get_cmd_line_argument("device", device, 0);
-
+  // Gets the number of devices for future validation
   cudaError_t result;
-  result = cudaGetDeviceProperties(&properties, device);
-
+  result = cudaGetDeviceCount(&num_devices);
   if (result != cudaSuccess) {
-    throw std::runtime_error("cudaGetDeviceProperties() failed for given device");
+    throw std::runtime_error("cudaGetNumDevices() failed");
   }
 
-  result = cudaSetDevice(device);
-  if (result != cudaSuccess) {
-    throw std::runtime_error("cudaSetDevice() failed for given device.");
+  // Gets the devices specified by the user
+  // This preserves the user specified order and checks for duplicates
+  {
+    std::vector<int> temp_device_list;
+    cmdline.get_cmd_line_arguments("devices", temp_device_list);
+    if (temp_device_list.empty()) {
+      temp_device_list.push_back(0);
+    }
+    {
+      std::set<int> temp_device_set;
+      for (int device : temp_device_list) {
+        auto res = temp_device_set.insert(device);
+        if (!res.second) {
+          throw std::runtime_error("Duplicate device specified: " +
+                                   std::to_string(device));
+        } else if (device > num_devices) {
+          throw std::runtime_error("Bad device ID: " +
+                                   std::to_string(device));
+        } else {
+          devices.push_back(device);
+        }
+      }
+    }
   }
 
-  // Permit overriding the compute capability
-  if (cmdline.check_cmd_line_flag("compute-capability")) {
-    int cc = compute_capability();
-    cmdline.get_cmd_line_argument("compute-capability", cc, cc);
-    properties.major = cc / 10;
-    properties.minor = cc % 10;
-  }
-  
-  // Permit overriding the L2 cache capacity
-  if (cmdline.check_cmd_line_flag("llc-capacity")) {
-    int llc_capacity = 0;
-    cmdline.get_cmd_line_argument("llc-capacity", llc_capacity, 0);
+  properties.resize(devices.size());
+  // Retrieves properties for all specified devices
+  for (size_t device_index = 0; device_index < devices.size(); device_index++) {
+    int device = devices[device_index];
 
-    if (llc_capacity >= 0) {
-      properties.l2CacheSize = (llc_capacity << 10);
+    result = cudaGetDeviceProperties(&properties[device_index], device);
+
+    if (result != cudaSuccess) {
+      throw std::runtime_error("cudaGetDeviceProperties() failed for given device");
     }
-  }
 
+    // Check that all devices are the same
+    if (device_index > 0) {
+      if ((properties[device_index].major != properties[0].major) ||
+          (properties[device_index].minor != properties[0].minor)) {
+        throw std::runtime_error("All selected devices must have the same "
+                                 "compute capability");
+      }
+      if (properties[device_index].l2CacheSize != properties[0].l2CacheSize) {
+        throw std::runtime_error("All selected devices must have the same "
+                                 "L2 cache size");
+      }
+      if (properties[device_index].multiProcessorCount != properties[0].multiProcessorCount) {
+        throw std::runtime_error("All selected devices must have the same "
+                                 "SM count");
+      }
+    }
+
+    result = cudaSetDevice(device);
+    if (result != cudaSuccess) {
+      throw std::runtime_error("cudaSetDevice() failed for given device.");
+    }
+
+    // Permit overriding the compute capability
+    if (cmdline.check_cmd_line_flag("compute-capability")) {
+      int cc = compute_capability(device_index);
+      cmdline.get_cmd_line_argument("compute-capability", cc, cc);
+      properties[device_index].major = cc / 10;
+      properties[device_index].minor = cc % 10;
+    }
+
+    // Permit overriding the L2 cache capacity
+    if (cmdline.check_cmd_line_flag("llc-capacity")) {
+      int llc_capacity = 0;
+      cmdline.get_cmd_line_argument("llc-capacity", llc_capacity, 0);
+
+      if (llc_capacity >= 0) {
+        properties[device_index].l2CacheSize = (llc_capacity << 10);
+      }
+    }
+
+  }
 }
 
 void Options::Device::print_usage(std::ostream &out) const {
 
   out << "Device:\n"
-    << "  --device=<int>                               "
-    << "    CUDA Device ID\n\n";
+    << "  --devices=<int>,<int>,...                      "
+    << "    CUDA Device IDs\n\n";
 
   int device_count = 0;
   cudaError_t result = cudaGetDeviceCount(&device_count);
@@ -111,11 +164,11 @@ void Options::Device::print_usage(std::ostream &out) const {
         break;
       }
       else {
-        out << "    [" << idx << "] - " 
-          << prop.name << " - SM " << prop.major << "." << prop.minor << ", " 
-          << prop.multiProcessorCount << " SMs @ " << (prop.clockRate / 1000.0) << " MHz, " 
+        out << "    [" << idx << "] - "
+          << prop.name << " - SM " << prop.major << "." << prop.minor << ", "
+          << prop.multiProcessorCount << " SMs @ " << (prop.clockRate / 1000.0) << " MHz, "
           << "L2 cache: " << (prop.l2CacheSize >> 20) << " MB, Global Memory: " << (prop.totalGlobalMem >> 30) << " GB"
-          << std::endl; 
+          << std::endl;
       }
     }
     out << "\n";
@@ -133,15 +186,8 @@ void Options::Device::print_usage(std::ostream &out) const {
 }
 
 void Options::Device::print_device_info(std::ostream &out) const {
-  int num_devices;
   cudaDeviceProp props;
-
   cudaError_t result;
-  result = cudaGetDeviceCount(&num_devices);
-
-  if (result != cudaSuccess) {
-    throw std::runtime_error("cudaGetNumDevices() failed");
-  }
 
   out << "Device Name,SM,CUDA Device ID,Phy Device ID" << std::endl;
 
@@ -165,14 +211,28 @@ void Options::Device::print_device_info(std::ostream &out) const {
 void Options::Device::print_options(std::ostream &out, int indent) const {
 
   out
-    << indent_str(indent) << "device: " << device << "\n"
-    << indent_str(indent) << "clock: " << int(double(properties.clockRate) / 1000.0) << "\n"
-    << indent_str(indent) << "compute-capability: " << compute_capability() << "\n";
+    << indent_str(indent) << "devices: ";
+  for (int device : devices) {
+    out << device << ',';
+  }
+  out
+    << "\n"
+    << indent_str(indent) << "clock: " << int(double(properties[0].clockRate) / 1000.0) << "\n"
+    << indent_str(indent) << "compute-capability: " << compute_capability(0) << "\n";
+}
+
+/// Returns the device ID from a device index
+int Options::Device::device_id(size_t device_index) const {
+  if (device_index > devices.size()) {
+    throw std::runtime_error("Out of bounds device index: " +
+                             std::to_string(device_index));
+  }
+  return devices.at(device_index);
 }
 
 /// Returns the compute capability of the listed device (e.g. 61, 60, 70, 75)
-int Options::Device::compute_capability() const {
-  return properties.major * 10 + properties.minor;
+int Options::Device::compute_capability(int device_index) const {
+  return properties[device_index].major * 10 + properties[device_index].minor;
 }
 
 /////////////////////////////////////////////////////////////////////////////////////////////////
@@ -207,10 +267,10 @@ Options::Initialization::Initialization(cutlass::CommandLine const &cmdline) {
   else {
     // profiler chosen data distribution (allowed to change based on numeric types)
     fix_data_distribution = false;
-    // set uniform data distribution with range [-4, 4] 
+    // set uniform data distribution with range [-4, 4]
     data_distribution.set_uniform(-4, 4, 0);
   }
-  
+
 
 }
 
@@ -248,10 +308,10 @@ void Options::Initialization::get_distribution(
   };
 
   // Initalize pnz values to a default value of 100%
-  dist.gaussian.pnz = 100.0;
-  dist.gaussian.pnzA = 100.0;
-  dist.gaussian.pnzB = 100.0;
-  dist.gaussian.pnzC = 100.0;
+  dist.gaussian.pnz = 1.0;
+  dist.gaussian.pnzA = 1.0;
+  dist.gaussian.pnzB = 1.0;
+  dist.gaussian.pnzC = 1.0;
 
   using KeyValueVector = std::vector<std::pair<std::string, std::string> >;
 
@@ -335,7 +395,7 @@ Options::Library::Library(cutlass::CommandLine const &cmdline) {
     std::string mode = "default";
     cmdline.get_cmd_line_argument("library-algo-mode", mode);
     algorithm_mode = from_string<AlgorithmMode>(mode);
-  }  
+  }
 
   if (cmdline.check_cmd_line_flag("library-algos")) {
 
@@ -353,7 +413,7 @@ Options::Library::Library(cutlass::CommandLine const &cmdline) {
       }
       else {
         int algo;
-        std::stringstream ss; 
+        std::stringstream ss;
 
         ss << token;
         ss >> algo;
@@ -396,12 +456,12 @@ void Options::Library::print_options(std::ostream &out, int indent) const {
 
 Options::Profiling::Profiling(cutlass::CommandLine const &cmdline) {
 
-  cmdline.get_cmd_line_argument("workspace-count", workspace_count, 0);  
+  cmdline.get_cmd_line_argument("workspace-count", workspace_count, 0);
   cmdline.get_cmd_line_argument("warmup-iterations", warmup_iterations, 10);
   cmdline.get_cmd_line_argument("profiling-iterations", iterations, 100);
   cmdline.get_cmd_line_argument("sleep-duration", sleep_duration, 50);
   cmdline.get_cmd_line_argument("profiling-enabled", enabled, true);
-  
+
   if (cmdline.check_cmd_line_flag("providers")) {
 
     std::vector<std::string> tokens;
@@ -416,7 +476,7 @@ Options::Profiling::Profiling(cutlass::CommandLine const &cmdline) {
   else {
     providers.push_back(library::Provider::kCUTLASS);
     providers.push_back(library::Provider::kCUBLAS);
-    providers.push_back(library::Provider::kCUDNN);      
+    providers.push_back(library::Provider::kCUDNN);
   }
 }
 
@@ -480,7 +540,7 @@ size_t Options::Profiling::index(library::Provider provider) const {
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
 Options::Verification::Verification(cutlass::CommandLine const &cmdline) {
-  
+
   cmdline.get_cmd_line_argument("verification-enabled", enabled, true);
   if (enabled) {
     cmdline.get_cmd_line_argument("verification-required", required, false);
@@ -500,7 +560,7 @@ Options::Verification::Verification(cutlass::CommandLine const &cmdline) {
   }
 
   if (cmdline.check_cmd_line_flag("verification-providers")) {
-    
+
     std::vector<std::string> tokens;
     cmdline.get_cmd_line_arguments("verification-providers", tokens);
 
@@ -516,7 +576,7 @@ Options::Verification::Verification(cutlass::CommandLine const &cmdline) {
   else {
     providers.push_back(library::Provider::kCUBLAS);
     providers.push_back(library::Provider::kReferenceDevice);
-    providers.push_back(library::Provider::kCUDNN);      
+    providers.push_back(library::Provider::kCUDNN);
   }
 }
 
@@ -583,11 +643,11 @@ size_t Options::Verification::index(library::Provider provider) const {
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
 Options::Report::Report(cutlass::CommandLine const &cmdline) {
-  
+
   cmdline.get_cmd_line_argument("append", append, false);
   cmdline.get_cmd_line_argument("output", output_path);
   cmdline.get_cmd_line_argument("junit-output", junit_output_path);
- 
+
   if (cmdline.check_cmd_line_flag("tags")) {
     cmdline.get_cmd_line_argument_pairs("tags", pivot_tags);
   }
@@ -687,11 +747,11 @@ Options::Options(cutlass::CommandLine const &cmdline):
   device(cmdline),
   initialization(cmdline),
   library(cmdline),
-  profiling(cmdline), 
-  verification(cmdline), 
+  profiling(cmdline),
+  verification(cmdline),
   report(cmdline),
   about(cmdline) {
-  
+
   if (cmdline.check_cmd_line_flag("mode")) {
     std::string token;
     cmdline.get_cmd_line_argument("mode", token);
diff --git a/tools/profiler/src/performance_report.cpp b/tools/profiler/src/performance_report.cpp
index 53708d96b0..12855a2fe8 100644
--- a/tools/profiler/src/performance_report.cpp
+++ b/tools/profiler/src/performance_report.cpp
@@ -94,7 +94,7 @@ PerformanceReport::PerformanceReport(
     if (options_.report.append) {
 
       std::ifstream test_output_file(op_file_name_);
-      
+
       if (test_output_file.is_open()) {
         print_header = false;
         test_output_file.close();
@@ -145,7 +145,7 @@ void PerformanceReport::append_result(PerformanceResult result) {
 
   if (options_.report.verbose) {
     std::cout << "\n";
-    print_result_pretty_(std::cout, result) << std::flush; 
+    print_result_pretty_(std::cout, result) << std::flush;
   }
 
   if (junit_output_file_.is_open()) {
@@ -237,7 +237,7 @@ static const char *disposition_status_color(Disposition disposition) {
 
 /// Prints the result in human readable form
 std::ostream & PerformanceReport::print_result_pretty_(
-  std::ostream &out, 
+  std::ostream &out,
   PerformanceResult const &result,
   bool use_shell_coloring) {
 
@@ -251,14 +251,14 @@ std::ostream & PerformanceReport::print_result_pretty_(
     int column_idx = 0;
     for (auto const & tag : options_.report.pivot_tags) {
       out << (column_idx++ ? "," : "") << tag.first << ":" << tag.second;
-    } 
+    }
 
     out << "\n";
   }
 
   std::string shell_color_bright = use_shell_coloring ? SHELL_COLOR_BRIGHT() : "";
   std::string shell_color_end = use_shell_coloring ? SHELL_COLOR_END() : "";
-  auto _disposition_status_color = [&](Disposition d) -> const char * { 
+  auto _disposition_status_color = [&](Disposition d) -> const char * {
     return use_shell_coloring ? disposition_status_color(d) : "";
   };
 
@@ -277,7 +277,7 @@ std::ostream & PerformanceReport::print_result_pretty_(
     static int const indent_spaces = 16;
 
     for(auto & m : result.verification_map) {
-      out  << std::right << std::setw(indent_spaces) << library::to_string(m.first, true) << ": " << to_string(m.second, true) << "\n";  
+      out  << std::right << std::setw(indent_spaces) << library::to_string(m.first, true) << ": " << to_string(m.second, true) << "\n";
     }
   }
 
@@ -287,7 +287,7 @@ std::ostream & PerformanceReport::print_result_pretty_(
   int column_idx = 0;
   for (auto const &arg : result.arguments) {
     if (!arg.second.empty()) {
-      out << " --" << arg.first << "=" << arg.second; 
+      out << " --" << arg.first << "=" << arg.second;
       column_idx += int(4 + arg.first.size() + arg.second.size());
       if (column_idx > 98) {
         out << "  \\\n                 ";
@@ -297,7 +297,7 @@ std::ostream & PerformanceReport::print_result_pretty_(
   }
   out << "\n\n";
 
-  out 
+  out
     << "           Bytes: " << result.bytes << "  bytes\n"
     << "           FLOPs: " << result.flops << "  flops\n"
     << "           FLOPs/Byte: " << (result.flops / result.bytes) << "\n\n";
@@ -325,7 +325,7 @@ std::ostream & PerformanceReport::print_csv_header_(
     out << (column_idx++ ? "," : "") << tag.first;
   }
 
-  out 
+  out
     << (column_idx ? "," : "") << "Problem,Provider"
     << ",OperationKind,Operation,Disposition,Status";
 
@@ -333,7 +333,7 @@ std::ostream & PerformanceReport::print_csv_header_(
     out << "," << arg_name;
   }
 
-  out 
+  out
     << ",Bytes"
     << ",Flops"
     << ",Flops/Byte"
@@ -347,7 +347,7 @@ std::ostream & PerformanceReport::print_csv_header_(
 
 /// Print the result in CSV output
 std::ostream & PerformanceReport::print_result_csv_(
-  std::ostream &out, 
+  std::ostream &out,
   PerformanceResult const &result) {
 
   int column_idx = 0;
@@ -357,8 +357,8 @@ std::ostream & PerformanceReport::print_result_csv_(
     out << (column_idx++ ? "," : "") << tag.second;
   }
 
-  out 
-    << (column_idx ? "," : "") 
+  out
+    << (column_idx ? "," : "")
     << result.problem_index
     << "," << to_string(result.provider, true)
     << "," << to_string(result.op_kind)
@@ -370,7 +370,7 @@ std::ostream & PerformanceReport::print_result_csv_(
     out << "," << arg.second;
   }
 
-  out 
+  out
     << "," << result.bytes
     << "," << result.flops
     << "," << result.flops / result.bytes
@@ -387,7 +387,7 @@ std::ostream & PerformanceReport::print_result_csv_(
   else {
     out << std::string(2
       , ','
-    ); 
+    );
   }
 
   return out;
@@ -451,25 +451,25 @@ std::ostream & PerformanceReport::print_junit_result_(std::ostream &out, Perform
   case Disposition::kNotSupported:
     skipped = true;
     break;
-  case Disposition::kPassed: 
+  case Disposition::kPassed:
   case Disposition::kNotVerified:
     break;
-  case Disposition::kFailed: 
+  case Disposition::kFailed:
   case Disposition::kIncorrect:
-    failed = true; 
+    failed = true;
     break;
   case Disposition::kInvalidProblem:
   case Disposition::kInvalid:
     error = true;
     break;
   };
-  
+
   if (skipped) {
     out << "status=\"notrun\"";
   } else {
     out << "status=\"run\"";
   }
-    
+
   out << ">" << std::endl;
 
   if (failed) {
@@ -488,7 +488,7 @@ std::ostream & PerformanceReport::print_junit_result_(std::ostream &out, Perform
 
   out << "  </testcase>" << std::endl;
 
-  return out;  
+  return out;
 
 }
 
diff --git a/tools/profiler/src/rank_2k_operation_profiler.cu b/tools/profiler/src/rank_2k_operation_profiler.cu
index e16ec4a2a6..df8ad40f3f 100644
--- a/tools/profiler/src/rank_2k_operation_profiler.cu
+++ b/tools/profiler/src/rank_2k_operation_profiler.cu
@@ -31,7 +31,7 @@
 /* \file
    \brief Execution environment
 
-  
+
 */
 
 #include <iostream>
@@ -54,7 +54,7 @@ namespace profiler {
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
 /// Ctor
-Rank2KOperationProfiler::Rank2KOperationProfiler(Options const &options): 
+Rank2KOperationProfiler::Rank2KOperationProfiler(Options const &options):
   OperationProfiler(
     options,
     library::OperationKind::kRank2K,
@@ -95,7 +95,7 @@ void Rank2KOperationProfiler::print_examples(std::ostream &out) const {
   out << "\nExamples:\n\n"
     << "Profile a particular problem size Syrk kernel:\n"
     << "  $ cutlass_profiler --operation=rank_2k --blas_mode=symmetric --n=1024 --k=128\n\n"
-    
+
     << "Profile a particular problem size Herk kernel:\n"
     << "  $ cutlass_profiler --operation=rank_2k --blas_mode=hermitian --n=1024 --k=128\n\n"
 
@@ -118,7 +118,7 @@ void Rank2KOperationProfiler::print_examples(std::ostream &out) const {
 
     << "Run a kernel with cta tile size of 256x128x32 and save workspace if results are incorrect (note that --cta-tile::k=32 is default cta-tile size):\n"
     << " $ cutlass_profiler --operation=rank_2k --cta_m=256 --cta_n=128  --cta_k=32 --save-workspace=incorrect\n\n"
-    
+
     << "Test your changes to rank_2k kernels with a quick functional test and save results in functional-test.csv:\n"
     << " $ cutlass_profiler  --operation=rank_2k \\ \n"
     << "   --n=8,56,120,136,256,264,512,520,1024,1032,4096,8192,16384 \\ \n"
@@ -148,22 +148,22 @@ Status Rank2KOperationProfiler::RankKProblem::parse(
   library::RankKDescription const &operation_desc,
   ProblemSpace const &problem_space,
   ProblemSpace::Problem const &problem) {
-  
+
   if (!arg_as_int(this->n, "n", problem_space, problem)) {
     // default value
     this->n = 1024;
   }
-  
+
   if (!arg_as_int(this->k, "k", problem_space, problem)) {
     // default value
     this->k = 1024;
   }
-  
+
   if (!arg_as_int(this->split_k_slices, "split_k_slices", problem_space, problem)) {
     // default value
     this->split_k_slices = 1;
   }
-  
+
   if (!arg_as_int(this->batch_count, "batch_count", problem_space, problem)) {
     // default value
     this->batch_count = 1;
@@ -187,29 +187,29 @@ Status Rank2KOperationProfiler::RankKProblem::parse(
   }
 
   if (!arg_as_scalar(
-    this->alpha, 
-    operation_desc.element_epilogue, 
-    "alpha", 
-    problem_space, 
+    this->alpha,
+    operation_desc.element_epilogue,
+    "alpha",
+    problem_space,
     problem)) {
 
     if (!cast_from_double(this->alpha, operation_desc.element_epilogue, 1)) {
       return Status::kErrorInternal;
     }
   }
-  
+
   if (!arg_as_scalar(
-    this->beta, 
-    operation_desc.element_epilogue, 
-    "beta", 
-    problem_space, 
+    this->beta,
+    operation_desc.element_epilogue,
+    "beta",
+    problem_space,
     problem)) {
-    
+
     if (!cast_from_double(this->beta, operation_desc.element_epilogue, 0)) {
       return Status::kErrorInternal;
     }
   }
-  
+
   this->lda = DeviceAllocation::get_packed_layout(
     operation_desc.A.layout, {int(this->n), int(this->k)}).front();
 
@@ -311,14 +311,14 @@ void Rank2KOperationProfiler::RankKProblem::initialize_result(
 
 /// Extracts the problem dimensions
 Status Rank2KOperationProfiler::initialize_configuration(
-  Options const &options,  
+  Options const &options,
   PerformanceReport &report,
   DeviceContext &device_context,
   library::Operation const *operation,
   ProblemSpace const &problem_space,
   ProblemSpace::Problem const &problem) {
 
-  library::RankKDescription const &operation_desc = 
+  library::RankKDescription const &operation_desc =
     static_cast<library::RankKDescription const &>(operation->description());
 
   if (operation_desc.rank_k_kind != library::RankKKind::kUniversal) {
@@ -326,7 +326,7 @@ Status Rank2KOperationProfiler::initialize_configuration(
   }
 
   Status status = problem_.parse(operation_desc, problem_space, problem);
-  
+
   if (status != Status::kSuccess) {
     return status;
   }
@@ -350,14 +350,14 @@ Status Rank2KOperationProfiler::initialize_configuration(
   rank_k_workspace_.arguments.pointer_mode = library::ScalarPointerMode::kHost;
 
   initialize_result_(this->model_result_, options, operation_desc, problem_space);
-  
+
   return operation->can_implement(&rank_k_workspace_.configuration, &rank_k_workspace_.arguments);
 }
 
 /// Initializes the performance result
 void Rank2KOperationProfiler::initialize_result_(
   PerformanceResult &result,
-  Options const &options,  
+  Options const &options,
   library::RankKDescription const &operation_desc,
   ProblemSpace const &problem_space) {
 
@@ -365,7 +365,7 @@ void Rank2KOperationProfiler::initialize_result_(
   result.disposition = Disposition::kNotRun;
   result.status = Status::kSuccess;
   result.operation_name = operation_desc.name;
-  
+
   problem_.initialize_result(result, operation_desc, problem_space);
 
   OperationProfiler::initialize_result_(result, operation_desc, problem_space);
@@ -380,19 +380,30 @@ void Rank2KOperationProfiler::initialize_result_(
 
 /// Initializes workspace
 Status Rank2KOperationProfiler::initialize_workspace(
-  Options const &options,  
+  Options const &options,
   PerformanceReport &report,
   DeviceContext &device_context,
   library::Operation const *operation,
   ProblemSpace const &problem_space,
   ProblemSpace::Problem const &problem) {
-  
-  library::RankKDescription const &operation_desc = 
+
+  if (options.device.devices.size() != 1) {
+    throw std::runtime_error("This operation profiler only supports a single "
+                             "device.");
+  }
+
+  cudaError_t result;
+  result = cudaSetDevice(options.device.device_id(0));
+  if (result != cudaSuccess) {
+    throw std::runtime_error("cudaSetDevice() failed.");
+  }
+
+  library::RankKDescription const &operation_desc =
     static_cast<library::RankKDescription const &>(operation->description());
 
   if (options.execution_mode != ExecutionMode::kDryRun) {
     int seed_shift = 0;
-    rank_k_workspace_.A = device_context.allocate_tensor(
+    rank_k_workspace_.A = device_context.allocate_and_initialize_tensor(
       options,
       "A",
       operation_desc.A.element,
@@ -400,10 +411,11 @@ Status Rank2KOperationProfiler::initialize_workspace(
       {int(problem_.n), int(problem_.k)},
       {int(problem_.lda)},
       1, // batch_count
-      seed_shift++
+      seed_shift++,
+      0 // device_index
     );
 
-    rank_k_workspace_.B = device_context.allocate_tensor(
+    rank_k_workspace_.B = device_context.allocate_and_initialize_tensor(
       options,
       "B",
       operation_desc.B.element,
@@ -411,10 +423,11 @@ Status Rank2KOperationProfiler::initialize_workspace(
       {int(problem_.n), int(problem_.k)},
       {int(problem_.ldb)},
       1, // batch_count
-      seed_shift++
+      seed_shift++,
+      0 // device_index
     );
 
-    rank_k_workspace_.C = device_context.allocate_tensor(
+    rank_k_workspace_.C = device_context.allocate_and_initialize_tensor(
       options,
       "C",
       operation_desc.C.element,
@@ -422,23 +435,30 @@ Status Rank2KOperationProfiler::initialize_workspace(
       {int(problem_.n), int(problem_.n)},
       {int(problem_.ldc)},
       1, // batch_count
-      seed_shift++
+      seed_shift++,
+      0 // device_index
     );
 
     rank_k_workspace_.Computed = device_context.allocate_tensor(
+      options,
       "D",
       operation_desc.C.element,
       operation_desc.C.layout,
       {int(problem_.n), int(problem_.n)},
-      {int(problem_.ldc)}
+      {int(problem_.ldc)},
+      1, // batch_count
+      0 // device_index
     );
 
     rank_k_workspace_.Reference = device_context.allocate_tensor(
+      options,
       "Reference",
       operation_desc.C.element,
       operation_desc.C.layout,
       {int(problem_.n), int(problem_.n)},
-      {int(problem_.ldc)}
+      {int(problem_.ldc)},
+      1, // batch_count
+      0 // device_index
     );
 
     rank_k_workspace_.Computed->copy_from_device(rank_k_workspace_.C->data());
@@ -487,7 +507,7 @@ Status Rank2KOperationProfiler::initialize_workspace(
 
 /// Verifies CUTLASS against references
 bool Rank2KOperationProfiler::verify_cutlass(
-  Options const &options,  
+  Options const &options,
   PerformanceReport &report,
   DeviceContext &device_context,
   library::Operation const *operation,
@@ -516,7 +536,7 @@ bool Rank2KOperationProfiler::verify_cutlass(
   //
 
   results_.back().status = operation->run(
-    &rank_k_workspace_.arguments, 
+    &rank_k_workspace_.arguments,
     rank_k_workspace_.host_workspace.data(),
     rank_k_workspace_.device_workspace.data());
 
@@ -564,8 +584,8 @@ bool Rank2KOperationProfiler::verify_cutlass(
       }
     }
 #endif // #if CUTLASS_ENABLE_CUBLAS
-    
-    // Update disposition to worst case verification outcome among all 
+
+    // Update disposition to worst case verification outcome among all
     // verification providers which are supported
     bool is_any_verification_run_passed = false;
     for(auto &m : results_.back().verification_map) {
@@ -591,7 +611,7 @@ bool Rank2KOperationProfiler::verify_cutlass(
 
 /// Verifies CUTLASS against references
 bool Rank2KOperationProfiler::verify_with_cublas_(
-  Options const &options,  
+  Options const &options,
   PerformanceReport &report,
   DeviceContext &device_context,
   library::Operation const *operation,
@@ -601,13 +621,13 @@ bool Rank2KOperationProfiler::verify_with_cublas_(
 
 #if CUTLASS_ENABLE_CUBLAS
 
-  library::RankKDescription const &rank_k_desc = 
+  library::RankKDescription const &rank_k_desc =
     static_cast<library::RankKDescription const &>(operation->description());
 
   //
   // Construct cuBLAS operators
   //
-    
+
   CublasCreate handle;
   cublasStatus_t status = handle.get_cublas_create_status();
 
@@ -636,8 +656,8 @@ bool Rank2KOperationProfiler::verify_with_cublas_(
     rank_k_workspace_.arguments.beta = problem_.beta.data();
     rank_k_workspace_.arguments.pointer_mode = library::ScalarPointerMode::kHost;
 
-    detail::cublasRankKDispatcher rank_k_op( 
-      rank_k_desc, 
+    detail::cublasRankKDispatcher rank_k_op(
+      rank_k_desc,
       rank_k_workspace_.configuration,
       rank_k_workspace_.arguments
     );
@@ -669,7 +689,7 @@ bool Rank2KOperationProfiler::verify_with_cublas_(
     );
 
     // Save workspace if incorrect
-    if (options.verification.save_workspace == SaveWorkspace::kIncorrect && 
+    if (options.verification.save_workspace == SaveWorkspace::kIncorrect &&
       results_.back().verification_map[library::Provider::kCUBLAS] == Disposition::kIncorrect) {
 
       save_workspace(
@@ -694,7 +714,7 @@ bool Rank2KOperationProfiler::verify_with_cublas_(
 
 /// Measures performance results
 bool Rank2KOperationProfiler::profile(
-  Options const &options,  
+  Options const &options,
   PerformanceReport &report,
   DeviceContext &device_context,
   library::Operation const *operation,
diff --git a/tools/profiler/src/rank_k_operation_profiler.cu b/tools/profiler/src/rank_k_operation_profiler.cu
index bf85245770..49fe54c846 100644
--- a/tools/profiler/src/rank_k_operation_profiler.cu
+++ b/tools/profiler/src/rank_k_operation_profiler.cu
@@ -31,7 +31,7 @@
 /* \file
    \brief Execution environment
 
-  
+
 */
 
 #include <iostream>
@@ -54,7 +54,7 @@ namespace profiler {
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
 /// Ctor
-RankKOperationProfiler::RankKOperationProfiler(Options const &options): 
+RankKOperationProfiler::RankKOperationProfiler(Options const &options):
   OperationProfiler(
     options,
     library::OperationKind::kRankK,
@@ -94,7 +94,7 @@ void RankKOperationProfiler::print_examples(std::ostream &out) const {
   out << "\nExamples:\n\n"
     << "Profile a particular problem size Syrk kernel:\n"
     << "  $ cutlass_profiler --operation=rank_k --blas_mode=symmetric --n=1024 --k=128\n\n"
-    
+
     << "Profile a particular problem size Herk kernel:\n"
     << "  $ cutlass_profiler --operation=rank_k --blas_mode=hermitian --n=1024 --k=128\n\n"
 
@@ -117,7 +117,7 @@ void RankKOperationProfiler::print_examples(std::ostream &out) const {
 
     << "Run a kernel with cta tile size of 256x128x32 and save workspace if results are incorrect (note that --cta-tile::k=32 is default cta-tile size):\n"
     << " $ cutlass_profiler --operation=rank_k --cta_m=256 --cta_n=128  --cta_k=32 --save-workspace=incorrect\n\n"
-    
+
     << "Test your changes to rank_k kernels with a quick functional test and save results in functional-test.csv:\n"
     << " $ cutlass_profiler  --operation=rank_k \\ \n"
     << "   --n=8,56,120,136,256,264,512,520,1024,1032,4096,8192,16384 \\ \n"
@@ -147,22 +147,22 @@ Status RankKOperationProfiler::RankKProblem::parse(
   library::RankKDescription const &operation_desc,
   ProblemSpace const &problem_space,
   ProblemSpace::Problem const &problem) {
-  
+
   if (!arg_as_int(this->n, "n", problem_space, problem)) {
     // default value
     this->n = 1024;
   }
-  
+
   if (!arg_as_int(this->k, "k", problem_space, problem)) {
     // default value
     this->k = 1024;
   }
-  
+
   if (!arg_as_int(this->split_k_slices, "split_k_slices", problem_space, problem)) {
     // default value
     this->split_k_slices = 1;
   }
-  
+
   if (!arg_as_int(this->batch_count, "batch_count", problem_space, problem)) {
     // default value
     this->batch_count = 1;
@@ -182,29 +182,29 @@ Status RankKOperationProfiler::RankKProblem::parse(
   }
 
   if (!arg_as_scalar(
-    this->alpha, 
-    operation_desc.element_epilogue, 
-    "alpha", 
-    problem_space, 
+    this->alpha,
+    operation_desc.element_epilogue,
+    "alpha",
+    problem_space,
     problem)) {
 
     if (!cast_from_double(this->alpha, operation_desc.element_epilogue, 1)) {
       return Status::kErrorInternal;
     }
   }
-  
+
   if (!arg_as_scalar(
-    this->beta, 
-    operation_desc.element_epilogue, 
-    "beta", 
-    problem_space, 
+    this->beta,
+    operation_desc.element_epilogue,
+    "beta",
+    problem_space,
     problem)) {
-    
+
     if (!cast_from_double(this->beta, operation_desc.element_epilogue, 0)) {
       return Status::kErrorInternal;
     }
   }
-  
+
   this->lda = DeviceAllocation::get_packed_layout(
     operation_desc.A.layout, {int(this->n), int(this->k)}).front();
 
@@ -252,7 +252,7 @@ int64_t RankKOperationProfiler::RankKProblem::flops(library::RankKDescription co
   case library::MathOperationID::kMultiplyAddComplexFastF32:
     flops_ *= 4;
     break;
-    
+
   case library::MathOperationID::kMultiplyAddGaussianComplex:
     flops_ *= 3;
     break;
@@ -300,14 +300,14 @@ void RankKOperationProfiler::RankKProblem::initialize_result(
 
 /// Extracts the problem dimensions
 Status RankKOperationProfiler::initialize_configuration(
-  Options const &options,  
+  Options const &options,
   PerformanceReport &report,
   DeviceContext &device_context,
   library::Operation const *operation,
   ProblemSpace const &problem_space,
   ProblemSpace::Problem const &problem) {
 
-  library::RankKDescription const &operation_desc = 
+  library::RankKDescription const &operation_desc =
     static_cast<library::RankKDescription const &>(operation->description());
 
   if (operation_desc.rank_k_kind != library::RankKKind::kUniversal) {
@@ -315,7 +315,7 @@ Status RankKOperationProfiler::initialize_configuration(
   }
 
   Status status = problem_.parse(operation_desc, problem_space, problem);
-  
+
   if (status != Status::kSuccess) {
     return status;
   }
@@ -337,14 +337,14 @@ Status RankKOperationProfiler::initialize_configuration(
   rank_k_workspace_.arguments.pointer_mode = library::ScalarPointerMode::kHost;
 
   initialize_result_(this->model_result_, options, operation_desc, problem_space);
-  
+
   return operation->can_implement(&rank_k_workspace_.configuration, &rank_k_workspace_.arguments);
 }
 
 /// Initializes the performance result
 void RankKOperationProfiler::initialize_result_(
   PerformanceResult &result,
-  Options const &options,  
+  Options const &options,
   library::RankKDescription const &operation_desc,
   ProblemSpace const &problem_space) {
 
@@ -352,7 +352,7 @@ void RankKOperationProfiler::initialize_result_(
   result.disposition = Disposition::kNotRun;
   result.status = Status::kSuccess;
   result.operation_name = operation_desc.name;
-  
+
   problem_.initialize_result(result, operation_desc, problem_space);
 
   OperationProfiler::initialize_result_(result, operation_desc, problem_space);
@@ -368,7 +368,7 @@ void RankKOperationProfiler::initialize_result_(
   case library::MathOperationID::kMultiplyAddComplex:
     result.flops *= 4;
     break;
-     
+
   case library::MathOperationID::kMultiplyAddComplexFastF32:
     result.flops *= 4;
     break;
@@ -380,19 +380,30 @@ void RankKOperationProfiler::initialize_result_(
 
 /// Initializes workspace
 Status RankKOperationProfiler::initialize_workspace(
-  Options const &options,  
+  Options const &options,
   PerformanceReport &report,
   DeviceContext &device_context,
   library::Operation const *operation,
   ProblemSpace const &problem_space,
   ProblemSpace::Problem const &problem) {
-  
-  library::RankKDescription const &operation_desc = 
+
+  if (options.device.devices.size() != 1) {
+    throw std::runtime_error("This operation profiler only supports a single "
+                             "device.");
+  }
+
+  cudaError_t result;
+  result = cudaSetDevice(options.device.device_id(0));
+  if (result != cudaSuccess) {
+    throw std::runtime_error("cudaSetDevice() failed.");
+  }
+
+  library::RankKDescription const &operation_desc =
     static_cast<library::RankKDescription const &>(operation->description());
 
   if (options.execution_mode != ExecutionMode::kDryRun) {
     int seed_shift = 0;
-    rank_k_workspace_.A = device_context.allocate_tensor(
+    rank_k_workspace_.A = device_context.allocate_and_initialize_tensor(
       options,
       "A",
       operation_desc.A.element,
@@ -400,10 +411,11 @@ Status RankKOperationProfiler::initialize_workspace(
       {int(problem_.n), int(problem_.k)},
       {int(problem_.lda)},
       1, // batch_count
-      seed_shift++
+      seed_shift++,
+      0 // device_index
     );
 
-    rank_k_workspace_.C = device_context.allocate_tensor(
+    rank_k_workspace_.C = device_context.allocate_and_initialize_tensor(
       options,
       "C",
       operation_desc.C.element,
@@ -411,23 +423,30 @@ Status RankKOperationProfiler::initialize_workspace(
       {int(problem_.n), int(problem_.n)},
       {int(problem_.ldc)},
       1, // batch_count
-      seed_shift++
+      seed_shift++,
+      0 // device_index
     );
 
     rank_k_workspace_.Computed = device_context.allocate_tensor(
+      options,
       "D",
       operation_desc.C.element,
       operation_desc.C.layout,
       {int(problem_.n), int(problem_.n)},
-      {int(problem_.ldc)}
+      {int(problem_.ldc)},
+      1, //batch_count
+      0 // device_index
     );
 
     rank_k_workspace_.Reference = device_context.allocate_tensor(
+      options,
       "Reference",
       operation_desc.C.element,
       operation_desc.C.layout,
       {int(problem_.n), int(problem_.n)},
-      {int(problem_.ldc)}
+      {int(problem_.ldc)},
+      1, //batch_count
+      0 // device_index
     );
 
     rank_k_workspace_.Computed->copy_from_device(rank_k_workspace_.C->data());
@@ -476,7 +495,7 @@ Status RankKOperationProfiler::initialize_workspace(
 
 /// Verifies CUTLASS against references
 bool RankKOperationProfiler::verify_cutlass(
-  Options const &options,  
+  Options const &options,
   PerformanceReport &report,
   DeviceContext &device_context,
   library::Operation const *operation,
@@ -504,7 +523,7 @@ bool RankKOperationProfiler::verify_cutlass(
   //
 
   results_.back().status = operation->run(
-    &rank_k_workspace_.arguments, 
+    &rank_k_workspace_.arguments,
     rank_k_workspace_.host_workspace.data(),
     rank_k_workspace_.device_workspace.data());
 
@@ -552,8 +571,8 @@ bool RankKOperationProfiler::verify_cutlass(
       }
     }
 #endif // #if CUTLASS_ENABLE_CUBLAS
-    
-    // Update disposition to worst case verification outcome among all 
+
+    // Update disposition to worst case verification outcome among all
     // verification providers which are supported
     bool is_any_verification_run_passed = false;
     for(auto &m : results_.back().verification_map) {
@@ -579,7 +598,7 @@ bool RankKOperationProfiler::verify_cutlass(
 
 /// Verifies CUTLASS against references
 bool RankKOperationProfiler::verify_with_cublas_(
-  Options const &options,  
+  Options const &options,
   PerformanceReport &report,
   DeviceContext &device_context,
   library::Operation const *operation,
@@ -589,13 +608,13 @@ bool RankKOperationProfiler::verify_with_cublas_(
 
 #if CUTLASS_ENABLE_CUBLAS
 
-  library::RankKDescription const &rank_k_desc = 
+  library::RankKDescription const &rank_k_desc =
     static_cast<library::RankKDescription const &>(operation->description());
 
   //
   // Construct cuBLAS operators
   //
-    
+
   CublasCreate handle;
   cublasStatus_t status = handle.get_cublas_create_status();
 
@@ -623,8 +642,8 @@ bool RankKOperationProfiler::verify_with_cublas_(
     rank_k_workspace_.arguments.beta = problem_.beta.data();
     rank_k_workspace_.arguments.pointer_mode = library::ScalarPointerMode::kHost;
 
-    detail::cublasRankKDispatcher rank_k_op( 
-      rank_k_desc, 
+    detail::cublasRankKDispatcher rank_k_op(
+      rank_k_desc,
       rank_k_workspace_.configuration,
       rank_k_workspace_.arguments
     );
@@ -656,7 +675,7 @@ bool RankKOperationProfiler::verify_with_cublas_(
     );
 
     // Save workspace if incorrect
-    if (options.verification.save_workspace == SaveWorkspace::kIncorrect && 
+    if (options.verification.save_workspace == SaveWorkspace::kIncorrect &&
       results_.back().verification_map[library::Provider::kCUBLAS] == Disposition::kIncorrect) {
 
       save_workspace(
@@ -681,7 +700,7 @@ bool RankKOperationProfiler::verify_with_cublas_(
 
 /// Measures performance results
 bool RankKOperationProfiler::profile(
-  Options const &options,  
+  Options const &options,
   PerformanceReport &report,
   DeviceContext &device_context,
   library::Operation const *operation,
diff --git a/tools/profiler/src/sparse_gemm_operation_profiler.cu b/tools/profiler/src/sparse_gemm_operation_profiler.cu
index 187c64d33b..939608f8bb 100644
--- a/tools/profiler/src/sparse_gemm_operation_profiler.cu
+++ b/tools/profiler/src/sparse_gemm_operation_profiler.cu
@@ -51,23 +51,23 @@ namespace profiler {
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
 /// Ctor
-SparseGemmOperationProfiler::SparseGemmOperationProfiler(Options const &options): 
+SparseGemmOperationProfiler::SparseGemmOperationProfiler(Options const &options):
   OperationProfiler(
     options,
     library::OperationKind::kSparseGemm,
     {
-  	  {ArgumentTypeID::kEnumerated, {"gemm_kind"}, "Variant of GEMM (e.g. sparse, ...)"},
-  	  {ArgumentTypeID::kInteger, {"m", "problem-size::m"}, "M dimension of the GEMM problem space"},
-    	{ArgumentTypeID::kInteger, {"n", "problem-size::n"}, "N dimension of the GEMM problem space"},
-	    {ArgumentTypeID::kInteger, {"k", "problem-size::k"}, "K dimension of the GEMM problem space"},
-    	{ArgumentTypeID::kTensor, {"A"}, "Tensor storing the A operand"},
-	    {ArgumentTypeID::kTensor, {"B"}, "Tensor storing the B operand"},
-  	  {ArgumentTypeID::kTensor, {"C"}, "Tensor storing the C operand"},
-  	  {ArgumentTypeID::kTensor, {"E"}, "Tensor storing the E operand"},
-  	  {ArgumentTypeID::kScalar, {"alpha", "epilogue::alpha"}, "Epilogue scalar alpha"},
-    	{ArgumentTypeID::kScalar, {"beta", "epilogue::beta"}, "Epilogue scalar beta"},
-	    {ArgumentTypeID::kInteger, {"split_k_slices"}, "Number of partitions of K dimension"},
-    	{ArgumentTypeID::kInteger, {"batch_count"}, "Number of GEMMs computed in one batch"},
+      {ArgumentTypeID::kEnumerated, {"gemm_kind"}, "Variant of GEMM (e.g. sparse, ...)"},
+      {ArgumentTypeID::kInteger, {"m", "problem-size::m"}, "M dimension of the GEMM problem space"},
+      {ArgumentTypeID::kInteger, {"n", "problem-size::n"}, "N dimension of the GEMM problem space"},
+      {ArgumentTypeID::kInteger, {"k", "problem-size::k"}, "K dimension of the GEMM problem space"},
+      {ArgumentTypeID::kTensor, {"A"}, "Tensor storing the A operand"},
+      {ArgumentTypeID::kTensor, {"B"}, "Tensor storing the B operand"},
+      {ArgumentTypeID::kTensor, {"C"}, "Tensor storing the C operand"},
+      {ArgumentTypeID::kTensor, {"E"}, "Tensor storing the E operand"},
+      {ArgumentTypeID::kScalar, {"alpha", "epilogue::alpha"}, "Epilogue scalar alpha"},
+      {ArgumentTypeID::kScalar, {"beta", "epilogue::beta"}, "Epilogue scalar beta"},
+      {ArgumentTypeID::kInteger, {"split_k_slices"}, "Number of partitions of K dimension"},
+      {ArgumentTypeID::kInteger, {"batch_count"}, "Number of GEMMs computed in one batch"},
     }
   ) {
 
@@ -109,7 +109,7 @@ void SparseGemmOperationProfiler::print_examples(std::ostream &out) const {
 
     << "Run a kernel with cta tile size of 256x128x32 and save workspace if results are incorrect (note that --cta-tile::k=32 is default cta-tile size):\n"
     << " $ cutlass_profiler --operation=SparseGemm --cta_m=256 --cta_n=128  --cta_k=32 --save-workspace=incorrect\n\n"
-    
+
     << "Test your changes to gemm kernels with a quick functional test and save results in functional-test.csv:\n"
     << " $ cutlass_profiler  --operation=SparseGemm \\ \n"
     << "   --m=8,56,120,136,256,264,512,520,1024,1032,4096,8192,16384 \\ \n"
@@ -125,7 +125,7 @@ Status SparseGemmOperationProfiler::SparseGemmProblem::parse(
   library::SparseGemmDescription const &operation_desc,
   ProblemSpace const &problem_space,
   ProblemSpace::Problem const &problem) {
-  
+
   if (!arg_as_int(this->m, "m", problem_space, problem)) {
     // default value
     this->m = 1024;
@@ -135,17 +135,17 @@ Status SparseGemmOperationProfiler::SparseGemmProblem::parse(
     // default value
     this->n = 1024;
   }
-  
+
   if (!arg_as_int(this->k, "k", problem_space, problem)) {
     // default value
     this->k = 1024;
   }
-  
+
   if (!arg_as_int(this->split_k_slices, "split_k_slices", problem_space, problem)) {
     // default value
     this->split_k_slices = 1;
   }
-  
+
   if (!arg_as_int(this->batch_count, "batch_count", problem_space, problem)) {
     // default value
     this->batch_count = 1;
@@ -168,24 +168,24 @@ Status SparseGemmOperationProfiler::SparseGemmProblem::parse(
   }
 
   if (!arg_as_scalar(
-    this->alpha, 
-    operation_desc.element_epilogue, 
-    "alpha", 
-    problem_space, 
+    this->alpha,
+    operation_desc.element_epilogue,
+    "alpha",
+    problem_space,
     problem)) {
 
     if (!cast_from_double(this->alpha, operation_desc.element_epilogue, 1)) {
       return Status::kErrorInternal;
     }
   }
-  
+
   if (!arg_as_scalar(
-    this->beta, 
-    operation_desc.element_epilogue, 
-    "beta", 
-    problem_space, 
+    this->beta,
+    operation_desc.element_epilogue,
+    "beta",
+    problem_space,
     problem)) {
-    
+
     if (!cast_from_double(this->beta, operation_desc.element_epilogue, 0)) {
       return Status::kErrorInternal;
     }
@@ -252,14 +252,14 @@ void SparseGemmOperationProfiler::SparseGemmProblem::initialize_result(
 
 /// Extracts the problem dimensions
 Status SparseGemmOperationProfiler::initialize_configuration(
-  Options const &options,  
+  Options const &options,
   PerformanceReport &report,
   DeviceContext &device_context,
   library::Operation const *operation,
   ProblemSpace const &problem_space,
   ProblemSpace::Problem const &problem) {
 
-  library::SparseGemmDescription const &operation_desc = 
+  library::SparseGemmDescription const &operation_desc =
     static_cast<library::SparseGemmDescription const &>(operation->description());
 
   if (operation_desc.gemm_kind != library::GemmKind::kSparse) {
@@ -291,14 +291,14 @@ Status SparseGemmOperationProfiler::initialize_configuration(
   gemm_workspace_.arguments.pointer_mode = library::ScalarPointerMode::kHost;
 
   initialize_result_(this->model_result_, options, operation_desc, problem_space);
-  
+
   return operation->can_implement(&gemm_workspace_.configuration, &gemm_workspace_.arguments);
 }
 
 /// Initializes the performance result
 void SparseGemmOperationProfiler::initialize_result_(
   PerformanceResult &result,
-  Options const &options,  
+  Options const &options,
   library::SparseGemmDescription const &operation_desc,
   ProblemSpace const &problem_space) {
 
@@ -308,7 +308,7 @@ void SparseGemmOperationProfiler::initialize_result_(
   result.operation_name = operation_desc.name;
 
   problem_.initialize_result(result, operation_desc, problem_space);
-  
+
   OperationProfiler::initialize_result_(result, operation_desc, problem_space);
 
   // Input bytes read and Output bytes written for the gemm problem
@@ -337,19 +337,30 @@ void SparseGemmOperationProfiler::initialize_result_(
 
 /// Initializes workspace
 Status SparseGemmOperationProfiler::initialize_workspace(
-  Options const &options,  
+  Options const &options,
   PerformanceReport &report,
   DeviceContext &device_context,
   library::Operation const *operation,
   ProblemSpace const &problem_space,
   ProblemSpace::Problem const &problem) {
-  
-  library::SparseGemmDescription const &operation_desc = 
+
+  if (options.device.devices.size() != 1) {
+    throw std::runtime_error("This operation profiler only supports a single "
+                             "device.");
+  }
+
+  cudaError_t result;
+  result = cudaSetDevice(options.device.device_id(0));
+  if (result != cudaSuccess) {
+    throw std::runtime_error("cudaSetDevice() failed.");
+  }
+
+  library::SparseGemmDescription const &operation_desc =
     static_cast<library::SparseGemmDescription const &>(operation->description());
 
   if (options.execution_mode != ExecutionMode::kDryRun) {
     int seed_shift = 0;
-    gemm_workspace_.A = device_context.allocate_tensor(
+    gemm_workspace_.A = device_context.allocate_and_initialize_tensor(
       options,
       "A",
       operation_desc.A.element,
@@ -357,10 +368,11 @@ Status SparseGemmOperationProfiler::initialize_workspace(
       {int(problem_.m), int(problem_.k) / int(problem_.sparse)},
       {int(problem_.lda)},
       1, // batch_count
-      seed_shift++
+      seed_shift++,
+      0 // device_index
     );
 
-    gemm_workspace_.B = device_context.allocate_tensor(
+    gemm_workspace_.B = device_context.allocate_and_initialize_tensor(
       options,
       "B",
       operation_desc.B.element,
@@ -368,10 +380,11 @@ Status SparseGemmOperationProfiler::initialize_workspace(
       {int(problem_.k), int(problem_.n)},
       {int(problem_.ldb)},
       1, // batch_count
-      seed_shift++
+      seed_shift++,
+      0 // device_index
     );
 
-    gemm_workspace_.C = device_context.allocate_tensor(
+    gemm_workspace_.C = device_context.allocate_and_initialize_tensor(
       options,
       "C",
       operation_desc.C.element,
@@ -379,18 +392,22 @@ Status SparseGemmOperationProfiler::initialize_workspace(
       {int(problem_.m), int(problem_.n)},
       {int(problem_.ldc)},
       1, // batch_count
-      seed_shift++
+      seed_shift++,
+      0 // device_index
     );
 
     gemm_workspace_.Computed = device_context.allocate_tensor(
+      options,
       "D",
       operation_desc.C.element,
       operation_desc.C.layout,
       {int(problem_.m), int(problem_.n)},
-      {int(problem_.ldc)}
+      {int(problem_.ldc)},
+      1, // batch_count
+      0 // device_index
     );
 
-    gemm_workspace_.E = device_context.allocate_sparsemeta_tensor(
+    gemm_workspace_.E = device_context.allocate_and_initialize_sparsemeta_tensor(
       options,
       "E",
       operation_desc.E.element,
@@ -399,15 +416,19 @@ Status SparseGemmOperationProfiler::initialize_workspace(
       {int(problem_.m), int(problem_.k) / int(problem_.sparse) / int(problem_.elements_per_128b)},
       {int(problem_.lde)},
       1, // batch_count
-      seed_shift++
+      seed_shift++,
+      0 // device_index
     );
 
     gemm_workspace_.Reference = device_context.allocate_tensor(
+      options,
       "Reference",
       operation_desc.C.element,
       operation_desc.C.layout,
       {int(problem_.m), int(problem_.n)},
-      {int(problem_.ldc)}
+      {int(problem_.ldc)},
+      1, // batch_count
+      0 // device_index
     );
 
     gemm_workspace_.Reference->copy_from_device(gemm_workspace_.C->data());
@@ -456,7 +477,7 @@ Status SparseGemmOperationProfiler::initialize_workspace(
 
 /// Verifies CUTLASS against references
 bool SparseGemmOperationProfiler::verify_cutlass(
-  Options const &options,  
+  Options const &options,
   PerformanceReport &report,
   DeviceContext &device_context,
   library::Operation const *operation,
@@ -486,7 +507,7 @@ bool SparseGemmOperationProfiler::verify_cutlass(
   //
 
   results_.back().status = operation->run(
-    &gemm_workspace_.arguments, 
+    &gemm_workspace_.arguments,
     gemm_workspace_.host_workspace.data(),
     gemm_workspace_.device_workspace.data());
 
@@ -510,7 +531,7 @@ bool SparseGemmOperationProfiler::verify_cutlass(
 
   if (options.verification.enabled) {
 
-    // Update disposition to worst case verification outcome among all 
+    // Update disposition to worst case verification outcome among all
     // verification providers which are supported
     bool is_any_verification_run_passed = false;
 
@@ -537,7 +558,7 @@ bool SparseGemmOperationProfiler::verify_cutlass(
 
 /// Measures performance results
 bool SparseGemmOperationProfiler::profile(
-  Options const &options,  
+  Options const &options,
   PerformanceReport &report,
   DeviceContext &device_context,
   library::Operation const *operation,
@@ -565,7 +586,7 @@ bool SparseGemmOperationProfiler::profile(
       gemm_workspace_.device_workspace.data()
     );
   }
-  
+
   return true;
 }
 
diff --git a/tools/profiler/src/symm_operation_profiler.cu b/tools/profiler/src/symm_operation_profiler.cu
index 478267759b..59fcf0f147 100644
--- a/tools/profiler/src/symm_operation_profiler.cu
+++ b/tools/profiler/src/symm_operation_profiler.cu
@@ -31,7 +31,7 @@
 /* \file
    \brief Execution environment
 
-  
+
 */
 
 #include <iostream>
@@ -54,7 +54,7 @@ namespace profiler {
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
 /// Ctor
-SymmOperationProfiler::SymmOperationProfiler(Options const &options): 
+SymmOperationProfiler::SymmOperationProfiler(Options const &options):
   OperationProfiler(
     options,
     library::OperationKind::kSymm,
@@ -96,7 +96,7 @@ void SymmOperationProfiler::print_examples(std::ostream &out) const {
   out << "\nExamples:\n\n"
     << "Profile a particular problem size SYMM kernel:\n"
     << "  $ cutlass_profiler --operation=Symm --blas_mode=symmetric --m=1024 --n=128\n\n"
-    
+
     << "Profile a particular problem size HEMM kernel:\n"
     << "  $ cutlass_profiler --operation=Symm --blas_mode=hermitian --m=1024 --n=128\n\n"
 
@@ -122,7 +122,7 @@ void SymmOperationProfiler::print_examples(std::ostream &out) const {
 
     << "Run a kernel with cta tile size of 256x128x32 and save workspace if results are incorrect (note that --cta-tile::k=32 is default cta-tile size):\n"
     << " $ cutlass_profiler --operation=Symm --cta_m=256 --cta_n=128  --cta_k=32 --save-workspace=incorrect\n\n"
-    
+
     << "Test your changes to symm kernels with a quick functional test and save results in functional-test.csv:\n"
     << " $ cutlass_profiler  --operation=Symm \\ \n"
     << "   --m=8,56,120,136,256,264,512,520,1024,1032,4096,8192,16384 \\ \n"
@@ -152,22 +152,22 @@ Status SymmOperationProfiler::SymmProblem::parse(
   library::SymmDescription const &operation_desc,
   ProblemSpace const &problem_space,
   ProblemSpace::Problem const &problem) {
-  
+
   if (!arg_as_int(this->m, "m", problem_space, problem)) {
     // default value
     this->m = 1024;
   }
-  
+
   if (!arg_as_int(this->n, "n", problem_space, problem)) {
     // default value
     this->n = 1024;
   }
-  
+
   if (!arg_as_int(this->split_k_slices, "split_k_slices", problem_space, problem)) {
     // default value
     this->split_k_slices = 1;
   }
-  
+
   if (!arg_as_int(this->batch_count, "batch_count", problem_space, problem)) {
     // default value
     this->batch_count = 1;
@@ -191,29 +191,29 @@ Status SymmOperationProfiler::SymmProblem::parse(
   }
 
   if (!arg_as_scalar(
-    this->alpha, 
-    operation_desc.element_epilogue, 
-    "alpha", 
-    problem_space, 
+    this->alpha,
+    operation_desc.element_epilogue,
+    "alpha",
+    problem_space,
     problem)) {
 
     if (!cast_from_double(this->alpha, operation_desc.element_epilogue, 1)) {
       return Status::kErrorInternal;
     }
   }
-  
+
   if (!arg_as_scalar(
-    this->beta, 
-    operation_desc.element_epilogue, 
-    "beta", 
-    problem_space, 
+    this->beta,
+    operation_desc.element_epilogue,
+    "beta",
+    problem_space,
     problem)) {
-    
+
     if (!cast_from_double(this->beta, operation_desc.element_epilogue, 0)) {
       return Status::kErrorInternal;
     }
   }
-  
+
   if (operation_desc.side_mode == SideMode::kLeft) {
     this->lda = DeviceAllocation::get_packed_layout(
       operation_desc.A.layout, {int(this->m), int(this->m)}).front();
@@ -240,12 +240,12 @@ int64_t SymmOperationProfiler::SymmProblem::bytes(library::SymmDescription const
   if (operation_desc.side_mode == SideMode::kLeft) {
     bytes =
       int64_t(library::sizeof_bits(operation_desc.A.element) * m / 8) * (m + 1) / 2 +
-      int64_t(library::sizeof_bits(operation_desc.B.element) * m / 8) * n + 
+      int64_t(library::sizeof_bits(operation_desc.B.element) * m / 8) * n +
       int64_t(library::sizeof_bits(operation_desc.C.element) * m / 8) * n;
   } else if (operation_desc.side_mode == SideMode::kRight) {
     bytes =
       int64_t(library::sizeof_bits(operation_desc.A.element) * n / 8) * (n + 1) / 2 +
-      int64_t(library::sizeof_bits(operation_desc.B.element) * m / 8) * n + 
+      int64_t(library::sizeof_bits(operation_desc.B.element) * m / 8) * n +
       int64_t(library::sizeof_bits(operation_desc.C.element) * m / 8) * n;
   }
   // Set is_beta_zero true if beta is zero
@@ -277,7 +277,7 @@ int64_t SymmOperationProfiler::SymmProblem::flops(library::SymmDescription const
   case library::MathOperationID::kMultiplyAddComplex:
     flops_ *= 4;
     break;
-    
+
   case library::MathOperationID::kMultiplyAddComplexFastF32:
     flops_ *= 4;
     break;
@@ -334,14 +334,14 @@ void SymmOperationProfiler::SymmProblem::initialize_result(
 
 /// Extracts the problem dimensions
 Status SymmOperationProfiler::initialize_configuration(
-  Options const &options,  
+  Options const &options,
   PerformanceReport &report,
   DeviceContext &device_context,
   library::Operation const *operation,
   ProblemSpace const &problem_space,
   ProblemSpace::Problem const &problem) {
 
-  library::SymmDescription const &operation_desc = 
+  library::SymmDescription const &operation_desc =
     static_cast<library::SymmDescription const &>(operation->description());
 
   if (operation_desc.symm_kind != library::SymmKind::kUniversal) {
@@ -349,14 +349,14 @@ Status SymmOperationProfiler::initialize_configuration(
   }
 
   Status status = problem_.parse(operation_desc, problem_space, problem);
-  
+
   if (status != Status::kSuccess) {
     return status;
   }
 
   symm_workspace_.configuration.problem_size.m() = int(problem_.m);
   symm_workspace_.configuration.problem_size.n() = int(problem_.n);
-  symm_workspace_.configuration.problem_size.k() = (operation_desc.side_mode == SideMode::kLeft) 
+  symm_workspace_.configuration.problem_size.k() = (operation_desc.side_mode == SideMode::kLeft)
                                                     ? int(problem_.m) : int(problem_.n);
   symm_workspace_.configuration.lda = problem_.lda;
   symm_workspace_.configuration.ldb = problem_.ldb;
@@ -374,14 +374,14 @@ Status SymmOperationProfiler::initialize_configuration(
   symm_workspace_.arguments.pointer_mode = library::ScalarPointerMode::kHost;
 
   initialize_result_(this->model_result_, options, operation_desc, problem_space);
-  
+
   return operation->can_implement(&symm_workspace_.configuration, &symm_workspace_.arguments);
 }
 
 /// Initializes the performance result
 void SymmOperationProfiler::initialize_result_(
   PerformanceResult &result,
-  Options const &options,  
+  Options const &options,
   library::SymmDescription const &operation_desc,
   ProblemSpace const &problem_space) {
 
@@ -389,7 +389,7 @@ void SymmOperationProfiler::initialize_result_(
   result.disposition = Disposition::kNotRun;
   result.status = Status::kSuccess;
   result.operation_name = operation_desc.name;
-  
+
   problem_.initialize_result(result, operation_desc, problem_space);
 
   OperationProfiler::initialize_result_(result, operation_desc, problem_space);
@@ -404,20 +404,31 @@ void SymmOperationProfiler::initialize_result_(
 
 /// Initializes workspace
 Status SymmOperationProfiler::initialize_workspace(
-  Options const &options,  
+  Options const &options,
   PerformanceReport &report,
   DeviceContext &device_context,
   library::Operation const *operation,
   ProblemSpace const &problem_space,
   ProblemSpace::Problem const &problem) {
-  
-  library::SymmDescription const &operation_desc = 
+
+  if (options.device.devices.size() != 1) {
+    throw std::runtime_error("This operation profiler only supports a single "
+                             "device.");
+  }
+
+  cudaError_t result;
+  result = cudaSetDevice(options.device.device_id(0));
+  if (result != cudaSuccess) {
+    throw std::runtime_error("cudaSetDevice() failed.");
+  }
+
+  library::SymmDescription const &operation_desc =
     static_cast<library::SymmDescription const &>(operation->description());
 
   if (options.execution_mode != ExecutionMode::kDryRun) {
     int seed_shift = 0;
     if (operation_desc.side_mode == SideMode::kLeft) {
-      symm_workspace_.A = device_context.allocate_tensor(
+      symm_workspace_.A = device_context.allocate_and_initialize_tensor(
         options,
         "A",
         operation_desc.A.element,
@@ -425,10 +436,11 @@ Status SymmOperationProfiler::initialize_workspace(
         {int(problem_.m), int(problem_.m)},
         {int(problem_.lda)},
         1, // batch_count
-        seed_shift++
+        seed_shift++,
+        0 // device_index
       );
     } else if (operation_desc.side_mode == SideMode::kRight) {
-      symm_workspace_.A = device_context.allocate_tensor(
+      symm_workspace_.A = device_context.allocate_and_initialize_tensor(
         options,
         "A",
         operation_desc.A.element,
@@ -436,11 +448,12 @@ Status SymmOperationProfiler::initialize_workspace(
         {int(problem_.n), int(problem_.n)},
         {int(problem_.lda)},
         1, // batch_count
-        seed_shift++
+        seed_shift++,
+        0 // device_index
       );
     }
 
-    symm_workspace_.B = device_context.allocate_tensor(
+    symm_workspace_.B = device_context.allocate_and_initialize_tensor(
       options,
       "B",
       operation_desc.B.element,
@@ -448,10 +461,11 @@ Status SymmOperationProfiler::initialize_workspace(
       {int(problem_.m), int(problem_.n)},
       {int(problem_.ldb)},
       1, // batch_count
-      seed_shift++
+      seed_shift++,
+      0 // device_index
     );
 
-    symm_workspace_.C = device_context.allocate_tensor(
+    symm_workspace_.C = device_context.allocate_and_initialize_tensor(
       options,
       "C",
       operation_desc.C.element,
@@ -459,23 +473,30 @@ Status SymmOperationProfiler::initialize_workspace(
       {int(problem_.m), int(problem_.n)},
       {int(problem_.ldc)},
       1, // batch_count
-      seed_shift++
+      seed_shift++,
+      0 // device_index
     );
 
     symm_workspace_.Computed = device_context.allocate_tensor(
+      options,
       "D",
       operation_desc.C.element,
       operation_desc.C.layout,
       {int(problem_.m), int(problem_.n)},
-      {int(problem_.ldc)}
+      {int(problem_.ldc)},
+      1, // batch_count
+      0 // device_index
     );
 
     symm_workspace_.Reference = device_context.allocate_tensor(
+      options,
       "Reference",
       operation_desc.C.element,
       operation_desc.C.layout,
       {int(problem_.m), int(problem_.n)},
-      {int(problem_.ldc)}
+      {int(problem_.ldc)},
+      1, // batch_count
+      0 // device_index
     );
 
     symm_workspace_.Computed->copy_from_device(symm_workspace_.C->data());
@@ -524,7 +545,7 @@ Status SymmOperationProfiler::initialize_workspace(
 
 /// Verifies CUTLASS against references
 bool SymmOperationProfiler::verify_cutlass(
-  Options const &options,  
+  Options const &options,
   PerformanceReport &report,
   DeviceContext &device_context,
   library::Operation const *operation,
@@ -553,7 +574,7 @@ bool SymmOperationProfiler::verify_cutlass(
   //
 
   results_.back().status = operation->run(
-    &symm_workspace_.arguments, 
+    &symm_workspace_.arguments,
     symm_workspace_.host_workspace.data(),
     symm_workspace_.device_workspace.data());
 
@@ -601,8 +622,8 @@ bool SymmOperationProfiler::verify_cutlass(
       }
     }
 #endif // #if CUTLASS_ENABLE_CUBLAS
-    
-    // Update disposition to worst case verification outcome among all 
+
+    // Update disposition to worst case verification outcome among all
     // verification providers which are supported
     bool is_any_verification_run_passed = false;
     for(auto &m : results_.back().verification_map) {
@@ -628,7 +649,7 @@ bool SymmOperationProfiler::verify_cutlass(
 
 /// Verifies CUTLASS against references
 bool SymmOperationProfiler::verify_with_cublas_(
-  Options const &options,  
+  Options const &options,
   PerformanceReport &report,
   DeviceContext &device_context,
   library::Operation const *operation,
@@ -638,13 +659,13 @@ bool SymmOperationProfiler::verify_with_cublas_(
 
 #if CUTLASS_ENABLE_CUBLAS
 
-  library::SymmDescription const &symm_desc = 
+  library::SymmDescription const &symm_desc =
     static_cast<library::SymmDescription const &>(operation->description());
 
   //
   // Construct cuBLAS operators
   //
-    
+
   CublasCreate handle;
   cublasStatus_t status = handle.get_cublas_create_status();
 
@@ -673,8 +694,8 @@ bool SymmOperationProfiler::verify_with_cublas_(
     symm_workspace_.arguments.beta = problem_.beta.data();
     symm_workspace_.arguments.pointer_mode = library::ScalarPointerMode::kHost;
 
-    detail::cublasSymmDispatcher symm_op( 
-      symm_desc, 
+    detail::cublasSymmDispatcher symm_op(
+      symm_desc,
       symm_workspace_.configuration,
       symm_workspace_.arguments
     );
@@ -706,7 +727,7 @@ bool SymmOperationProfiler::verify_with_cublas_(
     );
 
     // Save workspace if incorrect
-    if (options.verification.save_workspace == SaveWorkspace::kIncorrect && 
+    if (options.verification.save_workspace == SaveWorkspace::kIncorrect &&
       results_.back().verification_map[library::Provider::kCUBLAS] == Disposition::kIncorrect) {
 
       save_workspace(
@@ -731,7 +752,7 @@ bool SymmOperationProfiler::verify_with_cublas_(
 
 /// Measures performance results
 bool SymmOperationProfiler::profile(
-  Options const &options,  
+  Options const &options,
   PerformanceReport &report,
   DeviceContext &device_context,
   library::Operation const *operation,
diff --git a/tools/profiler/src/trmm_operation_profiler.cu b/tools/profiler/src/trmm_operation_profiler.cu
index 3e5e033ae4..5983b01168 100644
--- a/tools/profiler/src/trmm_operation_profiler.cu
+++ b/tools/profiler/src/trmm_operation_profiler.cu
@@ -31,7 +31,7 @@
 /* \file
    \brief Execution environment
 
-  
+
 */
 
 #include <iostream>
@@ -54,7 +54,7 @@ namespace profiler {
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
 /// Ctor
-TrmmOperationProfiler::TrmmOperationProfiler(Options const &options): 
+TrmmOperationProfiler::TrmmOperationProfiler(Options const &options):
   OperationProfiler(
     options,
     library::OperationKind::kTrmm,
@@ -113,7 +113,7 @@ void TrmmOperationProfiler::print_examples(std::ostream &out) const {
 
     << "Run a kernel with cta tile size of 256x128x32 and save workspace if results are incorrect (note that --cta-tile::k=32 is default cta-tile size):\n"
     << " $ cutlass_profiler --operation=Trmm --cta_m=256 --cta_n=128  --cta_k=32 --save-workspace=incorrect\n\n"
-    
+
     << "Test your changes to trmm kernels with a quick functional test and save results in functional-test.csv:\n"
     << " $ cutlass_profiler  --operation=Trmm \\ \n"
     << "   --n=8,56,120,136,256,264,512,520,1024,1032,4096,8192,16384 \\ \n"
@@ -143,22 +143,22 @@ Status TrmmOperationProfiler::TrmmProblem::parse(
   library::TrmmDescription const &operation_desc,
   ProblemSpace const &problem_space,
   ProblemSpace::Problem const &problem) {
-  
+
   if (!arg_as_int(this->m, "m", problem_space, problem)) {
     // default value
     this->m = 1024;
   }
-  
+
   if (!arg_as_int(this->n, "n", problem_space, problem)) {
     // default value
     this->n = 1024;
   }
-  
+
   if (!arg_as_int(this->split_k_slices, "split_k_slices", problem_space, problem)) {
     // default value
     this->split_k_slices = 1;
   }
-  
+
   if (!arg_as_int(this->batch_count, "batch_count", problem_space, problem)) {
     // default value
     this->batch_count = 1;
@@ -182,29 +182,29 @@ Status TrmmOperationProfiler::TrmmProblem::parse(
   }
 
   if (!arg_as_scalar(
-    this->alpha, 
-    operation_desc.element_epilogue, 
-    "alpha", 
-    problem_space, 
+    this->alpha,
+    operation_desc.element_epilogue,
+    "alpha",
+    problem_space,
     problem)) {
 
     if (!cast_from_double(this->alpha, operation_desc.element_epilogue, 1)) {
       return Status::kErrorInternal;
     }
   }
-  
+
   if (!arg_as_scalar(
-    this->beta, 
-    operation_desc.element_epilogue, 
-    "beta", 
-    problem_space, 
+    this->beta,
+    operation_desc.element_epilogue,
+    "beta",
+    problem_space,
     problem)) {
-    
+
     if (!cast_from_double(this->beta, operation_desc.element_epilogue, 0)) {
       return Status::kErrorInternal;
     }
   }
-  
+
   if (operation_desc.side_mode == SideMode::kLeft) {
     this->lda = DeviceAllocation::get_packed_layout(
       operation_desc.A.layout, {int(this->m), int(this->m)}).front();
@@ -265,14 +265,14 @@ void TrmmOperationProfiler::TrmmProblem::initialize_result(
 
 /// Extracts the problem dimensions
 Status TrmmOperationProfiler::initialize_configuration(
-  Options const &options,  
+  Options const &options,
   PerformanceReport &report,
   DeviceContext &device_context,
   library::Operation const *operation,
   ProblemSpace const &problem_space,
   ProblemSpace::Problem const &problem) {
 
-  library::TrmmDescription const &operation_desc = 
+  library::TrmmDescription const &operation_desc =
     static_cast<library::TrmmDescription const &>(operation->description());
 
   if (operation_desc.trmm_kind != library::TrmmKind::kUniversal) {
@@ -280,14 +280,14 @@ Status TrmmOperationProfiler::initialize_configuration(
   }
 
   Status status = problem_.parse(operation_desc, problem_space, problem);
-  
+
   if (status != Status::kSuccess) {
     return status;
   }
 
   trmm_workspace_.configuration.problem_size.m() = int(problem_.m);
   trmm_workspace_.configuration.problem_size.n() = int(problem_.n);
-  trmm_workspace_.configuration.problem_size.k() = (operation_desc.side_mode == SideMode::kLeft) 
+  trmm_workspace_.configuration.problem_size.k() = (operation_desc.side_mode == SideMode::kLeft)
                                                     ? int(problem_.m) : int(problem_.n);
   trmm_workspace_.configuration.lda = problem_.lda;
   trmm_workspace_.configuration.ldb = problem_.ldb;
@@ -303,14 +303,14 @@ Status TrmmOperationProfiler::initialize_configuration(
   trmm_workspace_.arguments.pointer_mode = library::ScalarPointerMode::kHost;
 
   initialize_result_(this->model_result_, options, operation_desc, problem_space);
-  
+
   return operation->can_implement(&trmm_workspace_.configuration, &trmm_workspace_.arguments);
 }
 
 /// Initializes the performance result
 void TrmmOperationProfiler::initialize_result_(
   PerformanceResult &result,
-  Options const &options,  
+  Options const &options,
   library::TrmmDescription const &operation_desc,
   ProblemSpace const &problem_space) {
 
@@ -318,30 +318,30 @@ void TrmmOperationProfiler::initialize_result_(
   result.disposition = Disposition::kNotRun;
   result.status = Status::kSuccess;
   result.operation_name = operation_desc.name;
-  
+
   problem_.initialize_result(result, operation_desc, problem_space);
 
   OperationProfiler::initialize_result_(result, operation_desc, problem_space);
 
   if (operation_desc.side_mode == SideMode::kLeft) {
     // Input bytes read and Output bytes written for the trmm problem
-    result.bytes = 
+    result.bytes =
       // Half matrix including the diagonal will have (M*(M+1))/2 elements
       int64_t(library::sizeof_bits(operation_desc.A.element) * problem_.m / 8) * (problem_.m + 1) / 2 +
-      int64_t(library::sizeof_bits(operation_desc.B.element) * problem_.m / 8) * problem_.n + 
+      int64_t(library::sizeof_bits(operation_desc.B.element) * problem_.m / 8) * problem_.n +
       int64_t(library::sizeof_bits(operation_desc.D.element) * problem_.m / 8) * problem_.n;
   } else if (operation_desc.side_mode == SideMode::kRight) {
     // Input bytes read and Output bytes written for the trmm problem
-    result.bytes = 
+    result.bytes =
       // Half matrix including the diagonal will have (N*(N+1))/2 elements
       int64_t(library::sizeof_bits(operation_desc.A.element) * problem_.n / 8) * (problem_.n + 1) / 2 +
-      int64_t(library::sizeof_bits(operation_desc.B.element) * problem_.m / 8) * problem_.n + 
+      int64_t(library::sizeof_bits(operation_desc.B.element) * problem_.m / 8) * problem_.n +
       int64_t(library::sizeof_bits(operation_desc.D.element) * problem_.m / 8) * problem_.n;
   }
 
   // FLOPs = 2 * [ ( M * (M+1)/2 * N ) ] // Beta is zero
   result.flops = problem_.m * (problem_.m + 1) * problem_.n;
- 
+
    result.runtime = 0;
 
   // complex-valued support
@@ -349,11 +349,11 @@ void TrmmOperationProfiler::initialize_result_(
   case library::MathOperationID::kMultiplyAddComplex:
     result.flops *= 4;
     break;
-    
+
   case library::MathOperationID::kMultiplyAddComplexFastF32:
     result.flops *= 4;
     break;
- 
+
   default: break;
   }
 
@@ -361,20 +361,31 @@ void TrmmOperationProfiler::initialize_result_(
 
 /// Initializes workspace
 Status TrmmOperationProfiler::initialize_workspace(
-  Options const &options,  
+  Options const &options,
   PerformanceReport &report,
   DeviceContext &device_context,
   library::Operation const *operation,
   ProblemSpace const &problem_space,
   ProblemSpace::Problem const &problem) {
-  
-  library::TrmmDescription const &operation_desc = 
+
+  if (options.device.devices.size() != 1) {
+    throw std::runtime_error("This operation profiler only supports a single "
+                             "device.");
+  }
+
+  cudaError_t result;
+  result = cudaSetDevice(options.device.device_id(0));
+  if (result != cudaSuccess) {
+    throw std::runtime_error("cudaSetDevice() failed.");
+  }
+
+  library::TrmmDescription const &operation_desc =
     static_cast<library::TrmmDescription const &>(operation->description());
 
   if (options.execution_mode != ExecutionMode::kDryRun) {
     int seed_shift = 0;
     if (operation_desc.side_mode == SideMode::kLeft) {
-      trmm_workspace_.A = device_context.allocate_tensor(
+      trmm_workspace_.A = device_context.allocate_and_initialize_tensor(
         options,
         "A",
         operation_desc.A.element,
@@ -382,10 +393,11 @@ Status TrmmOperationProfiler::initialize_workspace(
         {int(problem_.m), int(problem_.m)},
         {int(problem_.lda)},
         1, // batch_count
-        seed_shift++
+        seed_shift++,
+        0 // device_index
       );
     } else if (operation_desc.side_mode == SideMode::kRight) {
-      trmm_workspace_.A = device_context.allocate_tensor(
+      trmm_workspace_.A = device_context.allocate_and_initialize_tensor(
         options,
         "A",
         operation_desc.A.element,
@@ -393,11 +405,12 @@ Status TrmmOperationProfiler::initialize_workspace(
         {int(problem_.n), int(problem_.n)},
         {int(problem_.lda)},
         1, // batch_count
-        seed_shift++
+        seed_shift++,
+        0 // device_index
       );
     }
 
-    trmm_workspace_.B = device_context.allocate_tensor(
+    trmm_workspace_.B = device_context.allocate_and_initialize_tensor(
       options,
       "B",
       operation_desc.B.element,
@@ -405,23 +418,30 @@ Status TrmmOperationProfiler::initialize_workspace(
       {int(problem_.m), int(problem_.n)},
       {int(problem_.ldb)},
       1, // batch_count
-      seed_shift++
+      seed_shift++,
+      0 // device_index
     );
 
     trmm_workspace_.Computed = device_context.allocate_tensor(
+      options,
       "D",
       operation_desc.D.element,
       operation_desc.D.layout,
       {int(problem_.m), int(problem_.n)},
-      {int(problem_.ldd)}
+      {int(problem_.ldd)},
+      1, // batch_count
+      0 // device_index
     );
 
     trmm_workspace_.Reference = device_context.allocate_tensor(
+      options,
       "Reference",
       operation_desc.D.element,
       operation_desc.D.layout,
       {int(problem_.m), int(problem_.n)},
-      {int(problem_.ldd)}
+      {int(problem_.ldd)},
+      1, // batch_count
+      0 // device_index
     );
 
   }
@@ -467,7 +487,7 @@ Status TrmmOperationProfiler::initialize_workspace(
 
 /// Verifies CUTLASS against references
 bool TrmmOperationProfiler::verify_cutlass(
-  Options const &options,  
+  Options const &options,
   PerformanceReport &report,
   DeviceContext &device_context,
   library::Operation const *operation,
@@ -495,7 +515,7 @@ bool TrmmOperationProfiler::verify_cutlass(
   //
 
   results_.back().status = operation->run(
-    &trmm_workspace_.arguments, 
+    &trmm_workspace_.arguments,
     trmm_workspace_.host_workspace.data(),
     trmm_workspace_.device_workspace.data());
 
@@ -543,8 +563,8 @@ bool TrmmOperationProfiler::verify_cutlass(
       }
     }
 #endif // #if CUTLASS_ENABLE_CUBLAS
-    
-    // Update disposition to worst case verification outcome among all 
+
+    // Update disposition to worst case verification outcome among all
     // verification providers which are supported
     bool is_any_verification_run_passed = false;
     for(auto &m : results_.back().verification_map) {
@@ -570,7 +590,7 @@ bool TrmmOperationProfiler::verify_cutlass(
 
 /// Verifies CUTLASS against references
 bool TrmmOperationProfiler::verify_with_cublas_(
-  Options const &options,  
+  Options const &options,
   PerformanceReport &report,
   DeviceContext &device_context,
   library::Operation const *operation,
@@ -580,13 +600,13 @@ bool TrmmOperationProfiler::verify_with_cublas_(
 
 #if CUTLASS_ENABLE_CUBLAS
 
-  library::TrmmDescription const &trmm_desc = 
+  library::TrmmDescription const &trmm_desc =
     static_cast<library::TrmmDescription const &>(operation->description());
 
   //
   // Construct cuBLAS operators
   //
-    
+
   CublasCreate handle;
   cublasStatus_t status = handle.get_cublas_create_status();
 
@@ -614,8 +634,8 @@ bool TrmmOperationProfiler::verify_with_cublas_(
     trmm_workspace_.arguments.beta = problem_.beta.data();
     trmm_workspace_.arguments.pointer_mode = library::ScalarPointerMode::kHost;
 
-    detail::cublasTrmmDispatcher trmm_op( 
-      trmm_desc, 
+    detail::cublasTrmmDispatcher trmm_op(
+      trmm_desc,
       trmm_workspace_.configuration,
       trmm_workspace_.arguments
     );
@@ -646,7 +666,7 @@ bool TrmmOperationProfiler::verify_with_cublas_(
     );
 
     // Save workspace if incorrect
-    if (options.verification.save_workspace == SaveWorkspace::kIncorrect && 
+    if (options.verification.save_workspace == SaveWorkspace::kIncorrect &&
       results_.back().verification_map[library::Provider::kCUBLAS] == Disposition::kIncorrect) {
 
       save_workspace(
@@ -671,7 +691,7 @@ bool TrmmOperationProfiler::verify_with_cublas_(
 
 /// Measures performance results
 bool TrmmOperationProfiler::profile(
-  Options const &options,  
+  Options const &options,
   PerformanceReport &report,
   DeviceContext &device_context,
   library::Operation const *operation,
diff --git a/tools/util/include/cutlass/util/device_memory.h b/tools/util/include/cutlass/util/device_memory.h
index 953bbb6608..86abf6f12f 100644
--- a/tools/util/include/cutlass/util/device_memory.h
+++ b/tools/util/include/cutlass/util/device_memory.h
@@ -37,9 +37,11 @@
  */
 
 #include <memory>
+#include <sstream>
 
 #include "cutlass/platform/platform.h"
 #include "cutlass/numeric_types.h"
+#include "cutlass/trace.h"
 #include "exceptions.h"
 
 namespace cutlass {
@@ -54,8 +56,8 @@ template <typename T>
 T* allocate(size_t count = 1) {
 
   T* ptr = 0;
-
   size_t bytes = 0;
+
   bytes = count * sizeof(T);
 
 #if defined(CUTLASS_ENABLE_SYCL)
@@ -66,10 +68,24 @@ T* allocate(size_t count = 1) {
     }
   }
 #else
+
   cudaError_t cuda_error = cudaMalloc((void**)&ptr, bytes);
+
   if (cuda_error != cudaSuccess) {
+#if (CUTLASS_DEBUG_TRACE_LEVEL > 0)
+    std::ostringstream os;
+    os << "cutlass::device_memory::allocate: cudaMalloc failed: bytes=" << bytes;
+    CUTLASS_TRACE_HOST(os.str());
+#endif
     throw cuda_exception("Failed to allocate memory", cuda_error);
   }
+#if (CUTLASS_DEBUG_TRACE_LEVEL > 1)
+  else {
+    std::ostringstream os;
+    os << "cutlass::device_memory::allocate: Successful cudaMalloc: bytes=" << bytes;
+    CUTLASS_TRACE_HOST(os.str());
+  }
+#endif
 #endif
   return ptr;
 }
@@ -99,14 +115,39 @@ void free(T* ptr) {
 template <typename T>
 void copy(T* dst, T const* src, size_t count, cudaMemcpyKind kind) {
   size_t bytes = count * sizeof_bits<T>::value / 8;
-  if (bytes == 0 && count > 0)
+  if (bytes == 0 && count > 0) {
     bytes = 1;
+  }
 #if defined(CUTLASS_ENABLE_SYCL)
   syclcompat::memcpy(dst, src, bytes);
 #else
   cudaError_t cuda_error = (cudaMemcpy(dst, src, bytes, kind));
   if (cuda_error != cudaSuccess) {
-    throw cuda_exception("cudaMemcpy() failed", cuda_error);
+    std::ostringstream os;
+    os << "cutlass::device_memory::copy: cudaMemcpy() failed: "
+       << "dst=" << dst << ", src=" << src
+       << ", bytes=" << bytes << ", count=" << count;
+    if (kind == cudaMemcpyHostToDevice) {
+      os << ", kind=cudaMemcpyHostToDevice";
+    }
+    else if (kind == cudaMemcpyDeviceToHost) {
+      os << ", kind=cudaMemcpyDeviceToHost";
+    }
+    else if (kind == cudaMemcpyDeviceToDevice) {
+      os << ", kind=cudaMemcpyDeviceToDevice";
+    }
+    else if (kind == cudaMemcpyHostToHost) {
+      os << ", kind=cudaMemcpyHostToHost";
+    }
+    else if (kind == cudaMemcpyDefault) {
+      os << ", kind=cudaMemcpyDefault";
+    }
+    else {
+      os << ", kind=Unknown";
+    }
+    os << ", error: " << cudaGetErrorString(cuda_error);
+
+    throw cuda_exception(os.str().c_str(), cuda_error);
   }
 #endif
 }
diff --git a/tools/util/include/cutlass/util/device_nhwc_pooling.h b/tools/util/include/cutlass/util/device_nhwc_pooling.h
index 2d747206ae..05fe5584a1 100644
--- a/tools/util/include/cutlass/util/device_nhwc_pooling.h
+++ b/tools/util/include/cutlass/util/device_nhwc_pooling.h
@@ -361,9 +361,6 @@ void pooling_nhwc(cutlass::Tensor4DCoord input_tensor_size,
   assert(input_tensor_size.n() == output_tensor_size.n() &&
          input_tensor_size.c() == output_tensor_size.c());
 
-  assert(filter_tensor_size.h() == stride.row() &&
-         filter_tensor_size.w() == stride.column());
-
   const int N = input_tensor_size.n();
   const int H = input_tensor_size.h();
   const int W = input_tensor_size.w();
diff --git a/tools/util/include/cutlass/util/distribution.h b/tools/util/include/cutlass/util/distribution.h
index da6aad4ef6..649a573603 100644
--- a/tools/util/include/cutlass/util/distribution.h
+++ b/tools/util/include/cutlass/util/distribution.h
@@ -51,6 +51,8 @@ struct Distribution {
     struct {
       double min;
       double max;
+      // Percent elements set to NaN
+      double pnan;
     } uniform;
 
     /// Gaussian distribution
@@ -82,17 +84,18 @@ struct Distribution {
 
   Distribution() : kind(Invalid), int_scale(0) {}
 
-  /// Configures distribution as uniform random
-  Distribution &set_uniform(double _min, double _max, int _int_scale = 0) {
+/// Configures distribution as uniform random
+  Distribution &set_uniform(double _min, double _max, int _int_scale = 0, double _pnan = 0) {
     kind = Uniform;
     uniform.min = _min;
     uniform.max = _max;
     int_scale = _int_scale;
+    uniform.pnan = _pnan;
     return *this;
   }
 
   /// Configures distribution as Gaussian distribution
-  Distribution &set_gaussian(double _mean, double _stddev, int _int_scale = 0, double _pnz = 100.0) {
+  Distribution &set_gaussian(double _mean, double _stddev, int _int_scale = 0, double _pnz = 1.0) {
     kind = Gaussian;
     gaussian.mean = _mean;
     gaussian.stddev = _stddev;
@@ -125,7 +128,8 @@ struct Distribution {
 inline std::ostream &operator<<(std::ostream &out, cutlass::Distribution const &dist) {
   switch (dist.kind) {
     case cutlass::Distribution::Uniform:
-      out << "uniform, min: " << dist.uniform.min << ", max: " << dist.uniform.max;
+      out << "uniform, min: " << dist.uniform.min << ", max: " << dist.uniform.max
+          << ", pnan: " << dist.uniform.pnan;
       break;
     case cutlass::Distribution::Gaussian:
       out << "gaussian, mean: " << dist.gaussian.mean << ", stddev: " << dist.gaussian.stddev
diff --git a/tools/util/include/cutlass/util/host_tensor.h b/tools/util/include/cutlass/util/host_tensor.h
index b859153679..3f061875b4 100644
--- a/tools/util/include/cutlass/util/host_tensor.h
+++ b/tools/util/include/cutlass/util/host_tensor.h
@@ -177,16 +177,25 @@ class HostTensor {
   void reserve(
     size_t count,                                        ///< size of tensor in elements
     bool device_backed_ = true) {                        ///< if true, device memory is also allocated
+#if (CUTLASS_DEBUG_TRACE_LEVEL > 1)
+    CUTLASS_TRACE_HOST("cutlass::HostTensor::reserve(count=" << count << ", device_backed_=" << (device_backed_ ? "true" : "false") << ")");
+#endif
 
     device_.reset();
     host_.clear();
 
     size_t count_container = count_to_container_storage_unit_count(count);
+#if (CUTLASS_DEBUG_TRACE_LEVEL > 1)
+    CUTLASS_TRACE_HOST("cutlass::HostTensor::reserve: host_.resize(" << count_container << ")");
+#endif    
     host_.resize(count_container);
 
     // Allocate memory
     StorageUnit* device_memory = nullptr;
     if (device_backed_) {
+#if (CUTLASS_DEBUG_TRACE_LEVEL > 1)
+      CUTLASS_TRACE_HOST("cutlass::HostTensor::reserve: device_memory::allocate(" << count_container << ")");
+#endif
       device_memory = device_memory::allocate<StorageUnit>(count_container);
     }
     device_.reset(device_memory, device_backed_ ? count_container : 0);
@@ -394,7 +403,7 @@ class HostTensor {
   void sync_device() {
     if (device_backed()) {
       device_memory::copy_to_device(
-          device_.get(), host_.data(), host_.capacity());
+          device_.get(), host_.data(), host_.size());
     }
   }
 
diff --git a/tools/util/include/cutlass/util/packed_stride.hpp b/tools/util/include/cutlass/util/packed_stride.hpp
index f5b5e36765..e9a243a132 100644
--- a/tools/util/include/cutlass/util/packed_stride.hpp
+++ b/tools/util/include/cutlass/util/packed_stride.hpp
@@ -35,6 +35,8 @@
 #pragma once
 
 #include "cute/layout.hpp"
+#include "cute/container/array.hpp"   // cute::array
+#include "cutlass/conv/convolution.h" // cutlass::conv::Operator
 
 /////////////////////////////////////////////////////////////////////////////////////////////////
 
diff --git a/tools/util/include/cutlass/util/reference/device/tensor_fill.h b/tools/util/include/cutlass/util/reference/device/tensor_fill.h
index 1230863735..13aedf14d1 100644
--- a/tools/util/include/cutlass/util/reference/device/tensor_fill.h
+++ b/tools/util/include/cutlass/util/reference/device/tensor_fill.h
@@ -57,6 +57,7 @@
 #include "cutlass/complex.h"
 #include "cutlass/tensor_view.h"
 #include "cutlass/blas3.h"
+#include "cutlass/numeric_types.h"
 
 #include "cutlass/layout/vector.h"
 
@@ -117,6 +118,7 @@ struct RandomGaussianFunc {
     int int_scale;
     FloatType float_scale_up;
     FloatType float_scale_down;
+    int exclude_zero;           ///< If non-negative, excludes zeros
 
     //
     // Methods
@@ -127,12 +129,14 @@ struct RandomGaussianFunc {
       uint64_t seed_ = 0,
       Element mean_ = 0, 
       Element stddev_ = 1,
-      int int_scale_ = -1
+      int int_scale_ = -1,
+      int exclude_zero_ = -1
     ):
       seed(seed_), 
       mean(static_cast<FloatType>(mean_)), 
       stddev(static_cast<FloatType>(stddev_)), 
-      int_scale(int_scale_) {
+      int_scale(int_scale_),
+      exclude_zero(exclude_zero_) {
 
       float_scale_up = FloatType(IntType(2) << int_scale); // scale up to clamp low order bits
       float_scale_down = FloatType(1) / FloatType(IntType(2) << int_scale);
@@ -178,6 +182,15 @@ struct RandomGaussianFunc {
       result = Element(rnd);
     }
 
+    if (params.exclude_zero >=0 && result == Element(0.0)) {
+      if (rnd > FloatType(0)) {
+        rnd += FloatType(1);
+      } else {
+        rnd -= FloatType(1);
+      }
+      result = Element(rnd);
+    }
+
     return result;
   }
 };
@@ -203,6 +216,7 @@ struct RandomGaussianFunc<complex<Real>> {
     int int_scale;
     FloatType float_scale_up;
     FloatType float_scale_down;
+    int exclude_zero;           ///< If non-negative, excludes zeros
 
     //
     // Methods
@@ -213,12 +227,14 @@ struct RandomGaussianFunc<complex<Real>> {
       uint64_t seed_ = 0,
       Real mean_ = 0, 
       Real stddev_ = 1,
-      int int_scale_ = -1
+      int int_scale_ = -1,
+      int exclude_zero_ = -1
     ):
       seed(seed_), 
       mean(static_cast<FloatType>(mean_)), 
       stddev(static_cast<FloatType>(stddev_)), 
-      int_scale(int_scale_) {
+      int_scale(int_scale_),
+      exclude_zero(exclude_zero_) {
 
       float_scale_up = FloatType(IntType(1) << int_scale);
       float_scale_up += FloatType(0.5) * float_scale_up;
@@ -272,6 +288,18 @@ struct RandomGaussianFunc<complex<Real>> {
       result = Element(Real(rnd_r), Real(rnd_i));
     }
 
+    if (params.exclude_zero >= 0 && 
+        result.real() == Real(0.0) &&
+        result.imag() == Real(0.0)) {
+
+      if (rnd_r > FloatType(0)) {
+        rnd_r += FloatType(1);
+      } else {
+        rnd_r -= FloatType(1);
+      }
+      result = Element(Real(rnd_r), Real(rnd_i));
+    }
+
     return result;
   }
 };
@@ -358,6 +386,7 @@ void TensorFillRandomGaussian(
   int bits = -1,                          ///< If non-negative, specifies number of fractional bits that
                                           ///  are not truncated to zero. Permits reducing precision of
                                           ///  data.
+  int exclude_zero = -1,                  ///< If non-negative, excludes zeros from tensor init
   cudaStream_t stream = nullptr) {
 
   using RandomFunc = detail::RandomGaussianFunc<Element>;
@@ -366,7 +395,7 @@ void TensorFillRandomGaussian(
 
   TensorForEach<Func, Layout::kRank, Params>(
     view.extent(),
-    Params(view, typename RandomFunc::Params(seed, mean, stddev, bits)),
+    Params(view, typename RandomFunc::Params(seed, mean, stddev, bits, exclude_zero)),
     /*grid_size*/0, /*block_size*/0,
     stream
   );
@@ -399,7 +428,7 @@ void BlockFillRandomGaussian(
 
 namespace detail {
 
-/// Computes a random Gaussian distribution
+/// Computes a random uniform distribution
 template <typename Element>                ///< Element type 
 struct RandomUniformFunc {
 
@@ -424,8 +453,10 @@ struct RandomUniformFunc {
     FloatType range;
     FloatType max;
     int int_scale;
+    double pnan;
     FloatType float_scale_up;
     FloatType float_scale_down;
+    int exclude_zero;           ///< If non-negative, excludes zeros
 
     /// Default ctor
     CUTLASS_HOST_DEVICE
@@ -440,15 +471,25 @@ struct RandomUniformFunc {
       uint64_t seed_ = 0, 
       Element max_ = 1,
       Element min = 0,
-      int int_scale_ = -1
+      int int_scale_ = -1,
+      double pnan_ = 0,
+      int exclude_zero_ = -1
     ):
       seed(seed_), 
       range(static_cast<FloatType>(max_) - static_cast<FloatType>(min)), 
       max(static_cast<FloatType>(max_)),
-      int_scale(int_scale_) {
+      int_scale(int_scale_),
+      pnan(pnan_),
+      exclude_zero(exclude_zero_) {
       
       float_scale_up = FloatType(IntType(2) << int_scale); // scale up to clamp low order bits
       float_scale_down = FloatType(1) / FloatType(IntType(2) << int_scale);
+
+      // Handle cases where min = 0 or max = 0 for excluding zeros
+      if (exclude_zero >= 0) {
+        range = (min == Element(0)) ? range - FloatType(1): range;
+        max = (max_ == Element(0)) ? max - FloatType(1): max; 
+      }
     }
   };
 
@@ -479,6 +520,13 @@ struct RandomUniformFunc {
   CUTLASS_DEVICE
   Element operator()() {
 
+    // Draw random float in [0.0, 1.0] to determine if element should be NaN.
+    if constexpr (std::numeric_limits<Element>::has_quiet_NaN) {
+      if (params.pnan > 0 && (curand_uniform(&rng_state) < (params.pnan))) {
+        return Element(NAN);
+      }
+    }
+
     FloatType rnd = random_uniform_float<FloatType>(&rng_state);
     rnd = params.max - params.range * rnd;
 
@@ -494,6 +542,15 @@ struct RandomUniformFunc {
       result = Element(rnd);
     }
 
+    if (params.exclude_zero >=0 && result == Element(0.0)) {
+      if (rnd > FloatType(0)) {
+        rnd = std::min(params.max, rnd + FloatType(1));
+      } else {
+        rnd = std::max((params.max - params.range), rnd - FloatType(1));
+      }
+      result = Element(rnd);
+    }
+
     return result;
   }
 };
@@ -525,8 +582,10 @@ struct RandomUniformFunc<complex<Real>> {
     FloatType range;
     FloatType min;
     int int_scale;
+    double pnan;
     FloatType float_scale_up;
     FloatType float_scale_down;
+    int exclude_zero;           ///< If non-negative, excludes zeros
 
     /// Default ctor
     CUTLASS_HOST_DEVICE
@@ -541,16 +600,26 @@ struct RandomUniformFunc<complex<Real>> {
       uint64_t seed_ = 0, 
       FloatType max = 1,
       FloatType min_ = 0,
-      int int_scale_ = -1
+      int int_scale_ = -1,
+      double pnan_ = 0,
+      int exclude_zero_ = -1
     ):
       seed(seed_), 
       range(static_cast<FloatType>(max - min_)), 
       min(static_cast<FloatType>(min_)), 
-      int_scale(int_scale_) {
+      int_scale(int_scale_),
+      pnan(pnan_),
+      exclude_zero(exclude_zero_) {
 
       float_scale_up = FloatType(IntType(1) << int_scale);
       float_scale_up += FloatType(0.5) * float_scale_up;
       float_scale_down = FloatType(1) / FloatType(IntType(1) << int_scale);
+
+      // Handle cases where min = 0 or max = 0 for excluding zeros
+      if (exclude_zero >= 0) {
+        min = (min == FloatType(0)) ? min + FloatType(1): min;
+        range = (max == FloatType(0)) ? range - FloatType(1): range; 
+      }
     }
   };
 
@@ -581,6 +650,13 @@ struct RandomUniformFunc<complex<Real>> {
   CUTLASS_DEVICE
   Element operator()() {
 
+    // Draw random float in [0.0, 1.0] to determine if element should be NaN.
+    if constexpr (std::numeric_limits<Element>::has_quiet_NaN) {
+      if (params.pnan > 0 && (curand_uniform(&rng_state) < (params.pnan))) {
+        return Element(Real(NAN), Real(NAN));
+      }
+    }
+
     FloatType rnd_r = random_uniform_float<FloatType>(&rng_state);
     FloatType rnd_i = random_uniform_float<FloatType>(&rng_state);
 
@@ -604,11 +680,23 @@ struct RandomUniformFunc<complex<Real>> {
       result = Element(Real(rnd_r), Real(rnd_i));
     }
 
+    if (params.exclude_zero >= 0 && 
+        result.real() == Real(0.0) &&
+        result.imag() == Real(0.0)) {
+
+      if (rnd_r > FloatType(0)) {
+        rnd_r = std::min(params.min + params.range, rnd_r + FloatType(1));
+      } else {
+        rnd_r = std::max((params.min), rnd_r - FloatType(1));
+      }
+      result = Element(Real(rnd_r), Real(rnd_i));
+    }
+
     return result;
   }
 };
 
-/// Computes a random Gaussian distribution
+/// Computes a random uniform distribution
 template <
   typename Element,               ///< Element type
   typename Layout>                ///< Layout function
@@ -693,13 +781,15 @@ void TensorFillRandomUniform(
   int bits = -1,                          ///< If non-negative, specifies number of fractional bits that
                                           ///  are not truncated to zero. Permits reducing precision of
                                           ///  data.
+  double pnan = 0,                        ///< Percentage of NaN elements.
+  int exclude_zero = -1,               ///< If non-negative, excludes zeros from tensor init
   cudaStream_t stream = nullptr) {
 
   using RandomFunc = detail::RandomUniformFunc<Element>;
   using Func = detail::TensorFillRandomUniformFunc<Element, Layout>;
   using Params = typename Func::Params;
 
-  typename RandomFunc::Params random(seed, max, min, bits);
+  typename RandomFunc::Params random(seed, max, min, bits, pnan, exclude_zero);
 
   TensorForEach<Func, Layout::kRank, Params>(
     view.extent(),
@@ -722,11 +812,12 @@ void BlockFillRandomUniform(
   int bits = -1,                          ///< If non-negative, specifies number of fractional bits that
                                           ///  are not truncated to zero. Permits reducing precision of
                                           ///  data.
+  double pnan = 0,                        ///< Percentage of NaN elements.
   cudaStream_t stream = nullptr) {
 
   using RandomFunc = detail::RandomUniformFunc<Element>;
 
-  typename RandomFunc::Params params(seed, max, min, bits);
+  typename RandomFunc::Params params(seed, max, min, bits, pnan);
 
   BlockForEach<Element, RandomFunc>(ptr, capacity, params, /*grid_size*/0, /*block_size*/0, stream);
 }
@@ -1672,7 +1763,11 @@ void TensorFillRandom(
   TensorView<Element, Layout> view,       ///< destination tensor
   uint64_t seed,
   Distribution dist,
-  cudaStream_t stream = nullptr) {
+  cudaStream_t stream = nullptr,
+  int exclude_zero = -1                   ///< If non-negative, excludes 0.
+                                          ///  Note that setting this flag will result in more 1's,
+                                          ///  as we use a simple mechanism to replace 0's by adding/subtracting 1's.
+  ) {
 
   using Real = typename RealType<Element>::Type;
 
@@ -1683,6 +1778,7 @@ void TensorFillRandom(
       static_cast<Real>(dist.gaussian.mean),
       static_cast<Real>(dist.gaussian.stddev),
       dist.int_scale,
+      exclude_zero,
       stream);
   } else if (dist.kind == Distribution::Uniform) {
     TensorFillRandomUniform<Element, Layout>(
@@ -1691,6 +1787,8 @@ void TensorFillRandom(
       static_cast<Real>(dist.uniform.max),
       static_cast<Real>(dist.uniform.min),
       dist.int_scale,
+      dist.uniform.pnan,
+      exclude_zero,
       stream);
   }
 }
@@ -1753,6 +1851,7 @@ void BlockFillRandom(
       static_cast<Real>(dist.uniform.max),
       static_cast<Real>(dist.uniform.min),
       dist.int_scale,
+      dist.uniform.pnan,
       stream);
   }
 }
diff --git a/tools/util/include/cutlass/util/reference/host/conv.hpp b/tools/util/include/cutlass/util/reference/host/conv.hpp
index b5beb2a6d6..545dbba9a4 100644
--- a/tools/util/include/cutlass/util/reference/host/conv.hpp
+++ b/tools/util/include/cutlass/util/reference/host/conv.hpp
@@ -128,7 +128,8 @@ template<
   class EpilogueFusionParams
 >
 struct ConvReferenceImpl {
-  using ElementAcc = typename EpilogueFusionParams::ElementAcc;
+  // Hard code accumlulator type to float to avoid data lost in accumulating add.
+  using ElementAcc = cutlass::platform::conditional_t<cutlass::platform::is_same_v<typename EpilogueFusionParams::ElementAcc, double>, double, float>;
   using ElementC = typename EpilogueFusionParams::ElementC;
   using ElementOut = typename EpilogueFusionParams::ElementOut;
   using ElementScalar = typename EpilogueFusionParams::ElementScalar;
diff --git a/tools/util/include/cutlass/util/reference/host/gett.hpp b/tools/util/include/cutlass/util/reference/host/gett.hpp
index 9508dc49bf..f6984fb2ba 100644
--- a/tools/util/include/cutlass/util/reference/host/gett.hpp
+++ b/tools/util/include/cutlass/util/reference/host/gett.hpp
@@ -342,7 +342,8 @@ void gett_epilogue(
         ElementCompute converted_acc = accumulator_converter(acc[m_b][n_b]);
         // per-row alpha
         if (raw_pointer_cast(epilogue_params.Valpha.data())) {
-          converted_alpha = scale_converter(epilogue_params.Valpha(m + m_b));
+          converted_alpha = scale_converter(epilogue_params.Valpha(m + m_b, n + n_b, l));
+          converted_alpha = mul(converted_alpha, mul(converted_scale_a, converted_scale_b));
         }
         ElementCompute output = mul(converted_alpha, converted_acc);
 
@@ -355,7 +356,8 @@ void gett_epilogue(
           ElementCompute converted_src = source_converter(epilogue_params.C(m + m_b, n + n_b, l));
           // per-row beta
           if (epilogue_params.Vbeta.data()) {
-            converted_beta = scale_converter(epilogue_params.Vbeta(m + m_b));
+            converted_beta = scale_converter(epilogue_params.Vbeta(m + m_b, n + n_b, l));
+            converted_beta = mul(converted_beta, converted_scale_c);
           }
           output = epilogue_fma(converted_beta, converted_src, output);
         }
diff --git a/tools/util/include/cutlass/util/reference/host/tensor_fill.h b/tools/util/include/cutlass/util/reference/host/tensor_fill.h
index 7b748ea252..b9f0c84d9a 100644
--- a/tools/util/include/cutlass/util/reference/host/tensor_fill.h
+++ b/tools/util/include/cutlass/util/reference/host/tensor_fill.h
@@ -159,6 +159,7 @@ struct RandomGaussianFunc {
   int int_scale;
   double pi;
   double pnz;
+  bool exclude_zero;
 
   //
   // Methods
@@ -168,9 +169,10 @@ struct RandomGaussianFunc {
     double mean_ = 0, 
     double stddev_ = 1,
     int int_scale_ = -1,
-    double pnz_ = 100.0
+    double pnz_ = 1.0,
+    bool exclude_zero_ = false
   ):
-    seed(seed_), mean(mean_), stddev(stddev_), int_scale(int_scale_), pi(std::acos(-1)), pnz(pnz_) {
+    seed(seed_), mean(mean_), stddev(stddev_), int_scale(int_scale_), pi(std::acos(-1)), pnz(pnz_), exclude_zero(exclude_zero_) {
       std::srand((unsigned)seed);
   }
 
@@ -191,7 +193,7 @@ struct RandomGaussianFunc {
     // Sample from the Bernoulli distribution, and use the result to sample from the Gaussian
     std::random_device rnd_device;
     std::mt19937 bernoulli_rnd(rnd_device());
-    std::bernoulli_distribution bernoulli_dist(pnz / 100);
+    std::bernoulli_distribution bernoulli_dist(pnz);
     bool bernoulli_result = bernoulli_dist(bernoulli_rnd);
 
     // Sample from the Gaussian distribution for a nonzero element
@@ -208,6 +210,16 @@ struct RandomGaussianFunc {
       result = static_cast<Element>(0);
     }
 
+    // Note that exclude_zero = true will disable the bernoulli_result above by unsetting zeros
+    if (exclude_zero && result == Element(0)) {
+      if (rnd > 0) {
+        rnd += 1;
+      } else {
+        rnd -= 1;
+      }
+      result = Element(rnd);
+    }    
+
     return result;
   }
 };
@@ -222,6 +234,7 @@ struct RandomGaussianFunc<complex<Element> > {
   int int_scale;
   double pi;
   double pnz;
+  bool exclude_zero;
 
   //
   // Methods
@@ -231,9 +244,10 @@ struct RandomGaussianFunc<complex<Element> > {
     double mean_ = 0, 
     double stddev_ = 1,
     int int_scale_ = -1,
-    double pnz_ = 100.0
+    double pnz_ = 1.0,
+    bool exclude_zero_ = false
   ):
-    seed(seed_), mean(mean_), stddev(stddev_), int_scale(int_scale_), pi(std::acos(-1)), pnz(pnz_) {
+    seed(seed_), mean(mean_), stddev(stddev_), int_scale(int_scale_), pi(std::acos(-1)), pnz(pnz_), exclude_zero(exclude_zero_) {
       std::srand((unsigned)seed);
   }
 
@@ -249,7 +263,7 @@ struct RandomGaussianFunc<complex<Element> > {
     // Sample from the Bernoulli distribution, and use the result to sample from the Gaussian
     std::random_device rnd_device;
     std::mt19937 bernoulli_rnd(rnd_device());
-    std::bernoulli_distribution bernoulli_dist(pnz / 100);
+    std::bernoulli_distribution bernoulli_dist(pnz);
     bool bernoulli_result = bernoulli_dist(bernoulli_rnd);
 
     // Sample from the Gaussian distribution for a nonzero element
@@ -270,6 +284,19 @@ struct RandomGaussianFunc<complex<Element> > {
       reals[1] = from_real<Element>(0);
     }
 
+    // Note that this will invalidate the above else statement because it unsets zero elements
+    if (exclude_zero &&
+        reals[0] == from_real<Element>(0.0) &&
+        reals[1] == from_real<Element>(0.0)) {
+
+      if (rnd[0] > 0.0) {
+        rnd[0] += 1.0;
+      } else {
+        rnd[0] -= 1.0;
+      }
+      reals[0] = from_real<Element>(rnd[0]);
+    }
+
     return complex<Element>(reals[0], reals[1]);
   }
 };
@@ -284,6 +311,7 @@ struct RandomGaussianFunc<Quaternion<Element> > {
   int int_scale;
   double pi;
   double pnz;
+  bool exclude_zero;
 
   //
   // Methods
@@ -293,9 +321,10 @@ struct RandomGaussianFunc<Quaternion<Element> > {
     double mean_ = 0,
     double stddev_ = 1,
     int int_scale_ = -1,
-    double pnz_ = 100.0
+    double pnz_ = 1.0,
+    bool exclude_zero_ = false
   ):
-    seed(seed_), mean(mean_), stddev(stddev_), int_scale(int_scale_), pi(std::acos(-1)), pnz(pnz_) {
+    seed(seed_), mean(mean_), stddev(stddev_), int_scale(int_scale_), pi(std::acos(-1)), pnz(pnz_), exclude_zero(exclude_zero_) {
       std::srand((unsigned)seed);
   }
 
@@ -313,7 +342,7 @@ struct RandomGaussianFunc<Quaternion<Element> > {
     // Sample from the Bernoulli distribution, and use the result to sample from the Gaussian
     std::random_device rnd_device;
     std::mt19937 bernoulli_rnd(rnd_device());
-    std::bernoulli_distribution bernoulli_dist(pnz / 100);
+    std::bernoulli_distribution bernoulli_dist(pnz);
     bool bernoulli_result = bernoulli_dist(bernoulli_rnd);
 
     // Sample from the Gaussian distribution for a nonzero element
@@ -343,6 +372,21 @@ struct RandomGaussianFunc<Quaternion<Element> > {
       reals[3] = from_real<Element>(0);
     }
 
+    // Note that this will invalidate the above else statement because it unsets zero elements
+    if (exclude_zero &&
+        reals[0] == from_real<Element>(0) &&
+        reals[1] == from_real<Element>(0) &&
+        reals[2] == from_real<Element>(0) &&
+        reals[3] == from_real<Element>(0)) {
+
+      if (rnd1[0] > 0.0) {
+        rnd1[0] += 1.0;
+      } else {
+        rnd1[0] -= 1.0;
+      }
+      reals[0] = from_real<Element>(rnd1[0]);
+    }
+
     return Quaternion<Element>(reals[0], reals[1], reals[2], reals[3]);
   }
 };
@@ -440,10 +484,11 @@ void TensorFillRandomGaussian(
   double mean = 0,                        ///< Gaussian distribution's mean
   double stddev = 1,                      ///< Gaussian distribution's standard deviation
   int bits = -1,                          ///< If non-negative, specifies number of fractional bits that 
-  double pnz = 100.0) {                   ///  are not truncated to zero. Permits reducing precision of
+  double pnz = 1.0,                     ///  are not truncated to zero. Permits reducing precision of
                                           ///  data.
+  bool exclude_zero = false) {            ///< Exclude zeros from tensor init.
   
-  detail::RandomGaussianFunc<Element> random_func(seed, mean, stddev, bits, pnz);
+  detail::RandomGaussianFunc<Element> random_func(seed, mean, stddev, bits, pnz, exclude_zero);
 
   detail::TensorFillGaussianFunc<Element, Layout> func(
     dst,
@@ -466,8 +511,9 @@ void TensorFillRandomGaussian(
   double mean = 0,                                      ///< Gaussian distribution's mean
   double stddev = 1,                                    ///< Gaussian distribution's standard deviation
   int bits = -1,                                        ///< If non-negative, specifies number of fractional bits that 
-  double pnz = 100.0) {                                 ///  are not truncated to zero. Permits reducing precision of
+  double pnz = 1.0,                                   ///  are not truncated to zero. Permits reducing precision of
                                                         ///  data.
+  bool exclude_zero = false) {                          ///< Exclude zeros from tensor init.
   
   TensorFillRandomGaussian(dst.view_real(), seed, mean, stddev, bits, pnz);
   TensorFillRandomGaussian(dst.view_imag(), ~seed, mean, stddev, bits, pnz);
@@ -485,7 +531,7 @@ void TensorFillSymmetricRandomGaussian(
   double mean = 0,                        ///< Gaussian distribution's mean
   double stddev = 1,                      ///< Gaussian distribution's standard deviation
   int bits = -1,                          ///< If non-negative, specifies number of fractional bits that 
-  double pnz = 100.0) {                   ///  are not truncated to zero. Permits reducing precision of
+  double pnz = 1.0) {                   ///  are not truncated to zero. Permits reducing precision of
                                           ///  data.
 
   detail::RandomGaussianFunc<Element> random_func(seed, mean, stddev, bits, pnz);
@@ -515,7 +561,7 @@ void BlockFillRandomGaussian(
   double mean = 0,                        ///< Gaussian distribution's mean
   double stddev = 1,                      ///< Gaussian distribution's standard deviation
   int bits = -1,                          ///< If non-negative, specifies number of fractional bits that 
-  double pnz = 100.0) {                   ///  are not truncated to zero. Permits reducing precision of
+  double pnz = 1.0) {                   ///  are not truncated to zero. Permits reducing precision of
                                           ///  data.
   
 
@@ -542,23 +588,47 @@ struct RandomUniformFunc {
   double min;
   int int_scale;
 
-  //
-  // Methods
-  //
+  double pnan;
+private:
+  using engine_type = std::mt19937;
+public:
+  engine_type bernoulli_rnd;
+  std::bernoulli_distribution bernoulli_dist;
+
+  bool exclude_zero;
 
   RandomUniformFunc(
     uint64_t seed_ = 0, 
     double max = 1,
     double min_ = 0,
-    int int_scale_ = -1
+    int int_scale_ = -1,
+    double pnan_ = 0,
+    bool exclude_zero_ = false
   ):
-    seed(seed_), range(max - min_), min(min_), int_scale(int_scale_) {
+    seed(seed_), range(max - min_), min(min_), int_scale(int_scale_), pnan(pnan_)
+    , bernoulli_rnd{static_cast<engine_type::result_type>(seed_)}
+    , bernoulli_dist(pnan_)
+    , exclude_zero(exclude_zero_) 
+    {
       std::srand((unsigned)seed);
-    }
+      
+      // Handle cases where min = 0 or max = 0 for excluding zeros
+      if (exclude_zero) {
+        min = (min == 0.0) ? min + 1: min;
+        range = (max == 0.0) ? range - 1: range; 
+      }
+  }
 
 
   /// Compute random value and update RNG state
-  Element operator()() const {
+  Element operator()() {
+
+    // Sample from NaN distribution.
+    if constexpr (std::numeric_limits<Element>::has_quiet_NaN) {
+      if (pnan > 0 && bernoulli_dist(bernoulli_rnd)) {
+        return Element(NAN);
+      }
+    }
 
     double rnd = double(std::rand()) / double(RAND_MAX);
 
@@ -575,6 +645,15 @@ struct RandomUniformFunc {
       result = static_cast<Element>(Real(rnd));
     }
 
+    if (exclude_zero && result == Element(0)) {
+      if (rnd > 0.0) {
+        rnd = std::min(min + range, rnd + 1.0);
+      } else {
+        rnd = std::max(min, rnd - 1.0);
+      }
+      result = static_cast<Element>(Real(rnd));
+    }
+
     return result;
   }
 };
@@ -590,6 +669,15 @@ struct RandomUniformFunc<complex<Element> > {
   double min;
   int int_scale;
 
+  double pnan;
+private:
+  using engine_type = std::mt19937;
+public:
+  engine_type bernoulli_rnd;
+  std::bernoulli_distribution bernoulli_dist;
+
+  bool exclude_zero;
+
   //
   // Methods
   //
@@ -598,15 +686,33 @@ struct RandomUniformFunc<complex<Element> > {
     uint64_t seed_ = 0, 
     double max = 1,
     double min_ = 0,
-    int int_scale_ = -1
+    int int_scale_ = -1,
+    double pnan_ = 0,
+    bool exclude_zero_ = false
   ):
-    seed(seed_), range(max - min_), min(min_), int_scale(int_scale_) {
+    seed(seed_), range(max - min_), min(min_), int_scale(int_scale_), pnan(pnan_)
+    , bernoulli_rnd{static_cast<engine_type::result_type>(seed_)}
+    , bernoulli_dist(pnan_)
+    , exclude_zero(exclude_zero_) {
       std::srand((unsigned)seed);
-    }
+
+      // Handle cases where min = 0 or max = 0 for excluding zeros
+      if (exclude_zero) {
+        min = (min == 0.0) ? min + 1: min;
+        range = (max == 0.0) ? range - 1: range; 
+      }
+  }
 
 
   /// Compute random value and update RNG state
-  complex<Element> operator()() const {
+  complex<Element> operator()() {
+
+    // Sample from NaN distribution.
+    if constexpr (std::numeric_limits<Element>::has_quiet_NaN) {
+      if (pnan > 0 && bernoulli_dist(bernoulli_rnd)) {
+        return Element(NAN);
+      }
+    }
 
     Element reals[2];
 
@@ -625,6 +731,19 @@ struct RandomUniformFunc<complex<Element> > {
       else {
         reals[i] = from_real<Element>(Real(rnd));
       }
+
+      if (exclude_zero && 
+          i == 0 &&
+          reals[0] == from_real<Element>(0.0)) {
+
+        if (rnd > 0.0) {
+          rnd = std::min(min + range, rnd + 1.0);
+        } else {
+          rnd = std::max(min, rnd - 1.0);
+        }
+        reals[0] = from_real<Element>(Real(rnd));
+      }
+
     }
 
     return complex<Element>(reals[0], reals[1]);
@@ -642,6 +761,13 @@ struct RandomUniformFunc<Quaternion<Element> > {
   double min;
   int int_scale;
 
+  double pnan;
+private:
+  using engine_type = std::mt19937;
+public:
+  engine_type bernoulli_rnd;
+  std::bernoulli_distribution bernoulli_dist;
+
   //
   // Methods
   //
@@ -650,15 +776,26 @@ struct RandomUniformFunc<Quaternion<Element> > {
     uint64_t seed_ = 0,
     double max = 1,
     double min_ = 0,
-    int int_scale_ = -1
+    int int_scale_ = -1,
+    double pnan_ = 0
   ):
-    seed(seed_), range(max - min_), min(min_), int_scale(int_scale_) {
-      std::srand((unsigned)seed);
-    }
+    seed(seed_), range(max - min_), min(min_), int_scale(int_scale_), pnan(pnan_),
+    bernoulli_rnd{static_cast<engine_type::result_type>(seed_)},
+    bernoulli_dist(pnan_)
+  {
+    std::srand((unsigned)seed);
+  }
 
 
   /// Compute random value and update RNG state
-  Quaternion<Element> operator()() const {
+  Quaternion<Element> operator()() {
+
+    // Sample from NaN distribution.
+    if constexpr (std::numeric_limits<Element>::has_quiet_NaN) {
+      if (pnan > 0 && bernoulli_dist(bernoulli_rnd)) {
+        return Element(NAN);
+      }
+    }
 
     Element reals[4];
 
@@ -712,7 +849,7 @@ struct TensorFillRandomUniformFunc {
   }
 
   /// Compute random value and update RNG state
-  void operator()(Coord<Layout::kRank> const &coord) const {
+  void operator()(Coord<Layout::kRank> const &coord) {
 
     view.at(coord) = func();
   }
@@ -749,7 +886,7 @@ struct TensorFillSymmetricRandomUniformFunc {
   }
 
   /// Compute random value and update RNG state
-  void operator()(Coord<Layout::kRank> const &coord) const {
+  void operator()(Coord<Layout::kRank> const &coord) {
     // Fill half of matrix based on FillMode
     if (Layout::kRank == 2 && 
         fill_mode == cutlass::FillMode::kLower &&
@@ -796,7 +933,7 @@ struct TensorFillPadDiagonalRandomUniformFunc {
   }
 
   /// Compute random value and update RNG state
-  void operator()(Coord<Layout::kRank> const &coord) const {
+  void operator()(Coord<Layout::kRank> const &coord) {
     // Fill half of matrix based on FillMode
     if (Layout::kRank == 2 && 
         (fill_mode == cutlass::FillMode::kLower) &&
@@ -825,10 +962,12 @@ void TensorFillRandomUniform(
   uint64_t seed,                          ///< seed for RNG
   double max = 1,                         ///< upper bound of distribution
   double min = 0,                         ///< lower bound for distribution
-  int bits = -1) {                        ///< If non-negative, specifies number of fractional bits that 
+  int bits = -1,                          ///< If non-negative, specifies number of fractional bits that 
                                           ///  are not truncated to zero. Permits reducing precision of
-                                          ///  data.                 
-  detail::RandomUniformFunc<Element> random_func(seed, max, min, bits);
+                                          ///  data.
+  double pnan = 0,                        ///< Percentage of NaN elements.
+  bool exclude_zero = false) {            ///< Exclude zero from tensor init  
+  detail::RandomUniformFunc<Element> random_func(seed, max, min, bits, pnan, exclude_zero);
 
   detail::TensorFillRandomUniformFunc<Element, Layout> func(
     dst,
@@ -850,12 +989,14 @@ void TensorFillRandomUniform(
   uint64_t seed,                                       ///< seed for RNG
   double max = 1,                                      ///< upper bound of distribution
   double min = 0,                                      ///< lower bound for distribution
-  int bits = -1) {                                     ///< If non-negative, specifies number of fractional bits that
+  int bits = -1,                                       ///< If non-negative, specifies number of fractional bits that
                                                        ///  are not truncated to zero. Permits reducing precision of
                                                        ///  data.
+  double pnan = 0,                                     ///< Percentage of NaN elements.
+  bool exclude_zero = false) {                         ///< Exclude zero from tensor init 
 
-  TensorFillRandomUniform(dst.view_real(), seed, max, min, bits);
-  TensorFillRandomUniform(dst.view_imag(), ~seed, max, min, bits);
+  TensorFillRandomUniform(dst.view_real(), seed, max, min, bits, pnan, exclude_zero);
+  TensorFillRandomUniform(dst.view_imag(), ~seed, max, min, bits, pnan, exclude_zero);
 }
 
 
@@ -972,10 +1113,11 @@ void BlockFillRandomUniform(
   uint64_t seed,                          ///< seed for RNG
   double max = 1,                         ///< upper bound of distribution
   double min = 0,                         ///< lower bound for distribution
-  int bits = -1) {                        ///< If non-negative, specifies number of fractional bits that 
+  int bits = -1,                          ///< If non-negative, specifies number of fractional bits that 
                                           ///  are not truncated to zero. Permits reducing precision of
-                                          ///  data.                 
-  detail::RandomUniformFunc<Element> random_func(seed, max, min, bits);
+                                          ///  data.
+  double pnan = 0) {                      ///< Percentage of NaN elements.
+  detail::RandomUniformFunc<Element> random_func(seed, max, min, bits, pnan);
 
   for (size_t i = 0; i < capacity; ++i) {
     ReferenceFactory<Element>::get(ptr, i) = random_func();
@@ -1259,7 +1401,11 @@ template <
 void TensorFillRandom(
   TensorView<Element, Layout> view,       ///< destination tensor
   uint64_t seed,
-  Distribution dist) {
+  Distribution dist,
+  bool exclude_zero = false               ///< If true, excludes 0.
+                                          ///  Note that setting this flag will result in more 1's,
+                                          ///  as we use a simple mechanism to replace 0's by adding/subtracting 1's.
+) {
 
   using Real = typename RealType<Element>::Type;
 
@@ -1269,14 +1415,18 @@ void TensorFillRandom(
       seed,
       dist.gaussian.mean,
       dist.gaussian.stddev,
-      dist.int_scale);
+      dist.int_scale,
+      dist.gaussian.pnz,
+      exclude_zero);
   } else if (dist.kind == Distribution::Uniform) {
     TensorFillRandomUniform(
       view,
       seed,
       dist.uniform.max,
       dist.uniform.min,
-      dist.int_scale);
+      dist.int_scale,
+      dist.uniform.pnan,
+      exclude_zero);
   }
 }
 
@@ -1354,7 +1504,8 @@ void BlockFillRandom(
       seed, 
       dist.uniform.max,
       dist.uniform.min, 
-      dist.int_scale);
+      dist.int_scale,
+      dist.uniform.pnan);
   }
 }